一、简介:
以获取智联招聘网北京、上海、广州、深圳的“数据分析”相关岗位的招聘信息为例,以下是效果实现。
二、流程及代码实现
(一)playwright模块安装
或者在cmd中输入命令pip install playwright
安装完成后还需要安装playwright的浏览器驱动
在cmd中输入命令playwright install,等待即可。
(二)数据获取思路
①城市编码获取。
观察网页结构不难发现,网页跳转到不同城市,只需修改jl=xxx,那么去哪里获取城市编码呢?
如下图可知,城市编码存储在城市导航citymap的json数据中,这里简单使用requests库获取下。
# _1_getCityCode.py
import requests
import json
url = "https://www.zhaopin.com/citymap"
header = {
"Cookie": "x-zp-client-id=21b2be20-472f-4d0c-9dee-e457fc42fba1; locationInfo_search={%22code%22:%22685%22%2C%22name%22:%22%E6%B3%89%E5%B7%9E%22%2C%22message%22:%22%E5%8C%B9%E9%85%8D%E5%88%B0%E5%B8%82%E7%BA%A7%E7%BC%96%E7%A0%81%22}; selectCity_search=530; urlfrom2=121113803; adfbid2=0; sensorsdata2015jssdkchannel=%7B%22prop%22%3A%7B%22_sa_channel_landing_url%22%3A%22%22%7D%7D; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218befd88e51fe8-09fa19b491cdda8-7c546c76-921600-18befd88e52b41%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_utm_source%22%3A%22baidupcpz%22%2C%22%24latest_utm_medium%22%3A%22cpt%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMThiZWZkODhlNTFmZTgtMDlmYTE5YjQ5MWNkZGE4LTdjNTQ2Yzc2LTkyMTYwMC0xOGJlZmQ4OGU1MmI0MSJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%22%2C%22value%22%3A%22%22%7D%2C%22%24device_id%22%3A%2218befd88e51fe8-09fa19b491cdda8-7c546c76-921600-18befd88e52b41%22%7D; ssxmod_itna=GqRxuD2DnDc7K0KGH3mexUxDwK4mO7DRGxqTRTg4GNeKoDZDiqAPGhDCb+FOiqaw5K8KGOciqHDEOPKNQrB2g3YYGj+gNdDU4i8DCuG5oqmDYYCDt4DTD34DYDixibsxi5GRD0KDFzHNNZ+qDEDYpyDA3Di4D+zd=DmqG0DDtOS4G2D7Uy7jdY2qTMQGreiB=DjTTD/+hacB=3Oca=lwrb=60abqGyBPGunqXV/4TDCrtDyLiWW7pWmiLo7Ape3ix=+iPNi7DPQiqeYAYGex+0174dn38DGRwmnbD===; ssxmod_itna2=GqRxuD2DnDc7K0KGH3mexUxDwK4mO7DRGxqTRTxA=ceQD/9DF2DqhtMcwrKAPeYRD/44xs34tewvwlR=iu2Ag0A0yqhiz2GONNyRE0SNLg70MvT630Bkhi7Q+SPClkzO+70dq1r7iYKnEdwRxYyIbdwSoo7FCDBzO2mn4=vG2Nh9MtIWGGHzM50nfNyDQqaD7Q=zxGcDiQ=eD===; LastCity=%E6%B3%89%E5%B7%9E; LastCity%5Fid=685; Hm_lvt_38ba284938d5eddca645bb5e02a02006=1700559277,1700560013,1700566119,1700882798; acw_tc=2760828c17008969651903831e8eb913c75ee10d541bac609285468020241e; acw_sc__v2=6561a0c536f48100068fea791f1fbaab54993336; Hm_lvt_363368edd8b243d3ad4afde198719c4a=1700636299,1700640701,1700664421,1700896957; Hm_lpvt_363368edd8b243d3ad4afde198719c4a=1700896957",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188"
}
html = requests.get(url, headers=header)
res = html.text
start = "<script>__INITIAL_STATE__="
end = '</script><script src="//fecdn2'
start_index = res.index(start)
end_index = res.index(end)
# 将json数据保存到data.txt中
with open("data.txt", "w", encoding="utf-8") as f:
data = res[start_index+len("<script>__INITIAL_STATE__="):end_index]
f.write(data)
# 读取保存的json数据,解析到城市编码code,并保存到data_code.txt中
city = ["北京", "上海", "广州", "深圳"]
with open("data.txt", "r", encoding="utf-8") as f2:
data = json.loads(f2.read())
tmp = data["cityList"]["cityMapList"]
with open("data_code.txt", "w", encoding="utf-8") as fp:
for i in tmp:
a = tmp[i]
for j in a:
if j["name"] in city:
fp.write(j["name"]+j["code"]+"\n")
②招聘信息获取(重头戏)
Ⅰ、如何获取
当光标悬停在岗位上时,可以发现招聘信息存储在"position-detail-new?"里的json文件中。我们不妨事先定义响应的监听器,对选择器匹配的每个元素执行悬停操作,并将符合条件的响应数据保存到字典中。
Ⅱ、如何解决用户登录
如果不登录是无法查看更多内容的,而使用自动化工具时之前登录的记录又会被清除。
于是博主尝试记录登录后的cookie数据,再用于模拟操作,不可行。
在这里用一种相对”笨“一点方法,先进行用户登录,然后再在已打开的浏览器页面进行自动化程序。
(1)首先配置环境变量
(2)在cmd控制台输入chrome.exe --remote-debugging-port=12345 --user-data-dir="F:\\playwright_chrome"
在打开的页面进行用户登录,此时可以点击下一页查看更多内容了。
Ⅲ、执行程序
这里每遍历一个城市就将已获取到的城市的数据保存。
以下是完整的代码实现:
from playwright.sync_api import sync_playwright, Response
import json
# 1、记录Google浏览器路径
exepath = r"C:\Users\PC\AppData\Local\ms-playwright\chromium-1091\chrome-win"
# 2、读取城市名称及编码
with open("data_code.txt", "r", encoding="utf-8") as f:
data = f.readlines()
data = [i.strip() for i in data]
result = []
# 解析城市名称和编码
for item in data:
if item[-1].isdigit():
result.append([item[:-3], item[-3:]])
else:
result.append([item, ""])
# 3、开始获取数据
try:
with sync_playwright() as p:
# 连接到已运行的浏览器实例
browser = p.chromium.connect_over_cdp('http://localhost:12345/')
page = browser.contexts[0].pages[0]
# 遍历每个城市
for i in range(len(result)):
dict_name = f"data_{result[i][0]}"
jobMessage = {}
# 定义响应监听器,将符合条件的响应数据保存到字典中
def monitor_response(res: Response):
if "position-detail-new?" in res.url:
data = res.json()['data']
pn = data["detailedPosition"]["positionNumber"]
jobMessage[pn] = data
# 监听页面的响应事件
page.on("response", monitor_response)
code = result[i][1]
url = f"https://sou.zhaopin.com/?jl={code}&kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&p=1"
page.goto(url)
page.wait_for_timeout(5000)
# 对选择器匹配的每个元素执行悬停操作
page_number = 0
while True:
page_number += 1
print(f"正在爬取~~~{result[i][0]}~~~,目前是第{page_number}页~~~")
for item in page.query_selector_all(".iteminfo__line1__jobname__name"):
try:
item.hover()
except:
print(item, "获取出现异常")
pass
page.wait_for_timeout(500)
next_button = page.query_selector("text=下一页")
if not next_button:
break
try:
next_button.click()
except Exception:
break
page.wait_for_timeout(5000)
# 每遍历完一个城市的信息,直接存储到json文件中
with open(f"{dict_name}.json", mode="a", encoding="utf-8") as fp:
json.dump(jobMessage, fp)
print(f"~~~{result[i][0]}的数据已获取完毕~~~,数据已保存到{dict_name}.json中~~~")
browser.close()
except Exception:
print("中断")
三、词云图生成
将获取的json数据合并,并对其进行解析,获取岗位信息。以招聘需求栏为关键词生成词云图。手动点击保存,也可在代码中保存。
import os
import json
import matplotlib.pyplot as plt
from wordcloud import WordCloud
city_total = {}
city_list = os.listdir("city_data")
for i in city_list:
city_path = f"city_data//{i}"
with open(city_path, "r", encoding="utf-8")as f:
data = json.loads(f.read())
city_total.update(data)
with open("values.txt", "w", encoding="utf-8")as fp:
for key, job in data.items():
tmp = job["detailedPosition"]["skillLabel"]
for i in range(len(tmp)):
fp.write(tmp[i]["value"]+" ")
with open("values.txt", "r", encoding="utf-8")as f:
data = f.read()
# font_path参数指定ttf字体,直接在本机搜ttf,随便选取一个即可
wordcloud = WordCloud(width=800, height=400, background_color='white', collocations=False,
font_path=r"F:\Program Files\腾讯游戏\QQ飞车\Fonts\songti.ttf").generate(data)
# 绘制词云图并保存
plt.rcParams["font.sans-serif"] = ["kaiti"]
fig = plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
fig.savefig("wordcloud.png", dpi=300)
四、总结反思
1、该方法确实便捷,但存在不稳定性。在数据获取过程会因网速等原因出现异常,停止抓取,这也是为什么在抓取代码中使用print语句提示正在抓取哪个城市,哪一页的原因。其实在中间出现了好几次断点。
2、爬取速度较慢,仅北上广深四个城市就花费了半个小时左右。
3、如有错误和更优方案,欢迎各位前辈们指导。
4、仅供学习参考,非商用,如涉及法律问题立即联系作者删除。