首页 > 其他分享 >Boss直聘职位信息爬取+分析

Boss直聘职位信息爬取+分析

时间:2023-02-06 16:38:11浏览次数:51  
标签:直聘 url resp json detail Boss 爬取 job html


BOSS直聘职位信息爬取分析

先上结果,本次抓取了732条职位的信息入库:

Boss直聘职位信息爬取+分析_职位信息爬取


Boss直聘职位信息爬取+分析_json_02

代码实现:

import requests
import json
from lxml import etree
from requests.exceptions import RequestException
from pymysql import *
class BossSpider():
def get_html(self,url):
try:
headers = {
'user-agent': 'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Mobile Safari/537.36'
}
response = requests.get(url=url, headers=headers)
# 更改编码方式,否则会出现乱码的情况
response.encoding = "utf-8"
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def parse_search_html(self,resp):
base_url = 'https://www.zhipin.com'
resp = json.loads(resp)
resp_html = resp['html']
if resp_html:
resp_html = etree.HTML(resp_html)
detail_url_href = resp_html.xpath('//li[@class="item"]/a/@href')
detail_url_ka = resp_html.xpath('//li[@class="item"]/a/@ka')
detail_url_datalid = resp_html.xpath(r'//li[@class="item"]/a/@data-lid')
detail_url_href = [base_url+i for i in detail_url_href]
detail_url_params = []
for ka,lid in zip(detail_url_ka,detail_url_datalid):
url_param = {}
url_param['ka'] = ka
url_param['lid'] = lid
detail_url_params.append(url_param)
detail_url = [self.get_full_url(href,params) for href,params in zip(detail_url_href,detail_url_params)]
return detail_url
else:
return None

def parse_detail_html(self, resp,job_json):
resp_html = etree.HTML(resp)
job_detail = resp_html.xpath('//div[@class="job-detail"]/div[@class="detail-content"]/div[1]/div//text()')
job_detail = [i.replace('\n','').replace(' ','') for i in job_detail]
job_detail = ''.join(job_detail)

# import ipdb
# ipdb.set_trace()

company_name = resp_html.xpath('//div[@class="info-primary"]/div[@class="flex-box"]/div[@class="name"]/text()')[0]
job_title = resp_html.xpath('//div[@id="main"]/div[@class="job-banner"]/h1[@class="name"]/text()')[0]
job_salary = resp_html.xpath('//div[@id="main"]/div[@class="job-banner"]//span[@class="salary"]//text()')[0]
job_vline = resp_html.xpath('//div[@id="main"]/div[@class="job-banner"]/p/text()')[2]
job_years = resp_html.xpath('//div[@id="main"]/div[@class="job-banner"]/p/text()')[1]
job_json['job_detail'] = job_detail
job_json['company_name'] = company_name
job_json['job_title'] = job_title
job_json['job_salary'] = job_salary
job_json['job_vline'] = job_vline
job_json['job_years'] = job_years
self.connect_mysql(job_json)
def get_full_url(self,url, uri: dict) -> str:
return url + '?' + '&'.join([str(key) + '=' + str(value) for key, value in uri.items()])
def connect_mysql(self,job_json):
try:
job_list = [job_json['company_name'],job_json['job_title'],
job_json['job_salary'],job_json['job_vline'],
job_json['job_years'],job_json['job_detail'],job_json['keyword']]
conn = connect(host = 'localhost',port = 3306,
user = 'root',passwd = '1996101',
db = 'spider',charset='utf8')
cursor = conn.cursor()
sql = 'insert into boss_job(company_name,job_title,job_salary,job_vline,job_years,job_detail,keyword) values(%s,%s,%s,%s,%s,%s,%s)'
cursor.execute(sql,job_list)
conn.commit()
cursor.close()
conn.close()
print('一条信息录入....')
except Exception as e:
print(e)
def spider(self,url,job_json):
resp_search = self.get_html(url)
detail_urls = self.parse_search_html(resp_search)
if detail_urls:
for detail_url in detail_urls:
resp_detail = self.get_html(detail_url)
if type(resp_detail) is str:
self.parse_detail_html(resp_detail,job_json)
if __name__ == '__main__':
keywords = ['数据']
MAX_PAGE = 20
for keyword in keywords:
for page_num in range(MAX_PAGE):
url = f'https://www.zhipin.com/mobile/jobs.json?page={page_num+1}&city=101210100&query={keyword}'
job_json = {'keyword':keyword}
boss_spider = BossSpider()
boss_spider.spider(url,job_json)


标签:直聘,url,resp,json,detail,Boss,爬取,job,html
From: https://blog.51cto.com/u_15955938/6039467

相关文章

  • python爬取网站指定数据并存入excel
    1:安装库pipinstallbeautifulsoup4pipinstallpandas2:爬取数据我们拿 https://cuiqingcai.com/archives/ 网站为例子,来进行爬取文章标题importrequestsfrom......
  • 67使用slenium自动化爬取200页职位信息(也可以用playwright)
    使用虚拟环境创建一个selenium版本>4因为反爬比较严重这里没用协议弄采用selenium思路:1.先用selenium,获取网页(这里获取外页,内页请求量太大),2.再解析得到我们想要的......
  • 记一次selenium爬取p站图片的经历
    突发奇想,爬取p站图片做个壁纸图库(bukemiaoshu),当然这里有许多的门槛,但是为了实现理想,暂时没想那么多了,直接开干(不是专业做测试和自动化的,如有大佬请评论指教!!!)1......
  • 新闻文本爬取——以央广网为例
    目录crawlingcrawling1.xcrawling1.0crawling2.xcrawling2.0crawling2.1crawling3.xcrawling3.0crawling3.1crawling3.2crawling3.3crawlingcrawling1.xcrawling1.0imp......
  • 64爬取b站,微博,ai问答等数据写入excel
    #功能1:获取手机号归属地#功能2:查询天气#功能3:查询百度热搜#功能4:查询微博热搜#功能5:查询b站#功能6ai问答(在这用不了涉及网站逆向写在另外一个py模块,没写入到......
  • 淘宝天猫商品详情爬取方案app详情sku数据如何获取?
    背景商品详情包含了非常多的数据,如sku、价格、库存、店铺名称、店铺logo、开店时间、旺旺、主图、标题等等,很多行业都有需要,比如电商相关行业、淘客、电商软件等都需要用到......
  • 17.爬取天天基金中万家精选混合A (519185)的净值数据
    1#爬虫2#该项目是爬取天天基金网某只基金的净值数据34#1.引入包5#网络请求6importjson78importrequests9#正则10importre11#数据......
  • ython爬取异步ajax数据
      使用selenium这个网页自动化测试工具,去获得源码。因为这个工具是等到页面加载完成采取获取的整个页面的代码,所以理论上是可以获得页面完整数据的。 defreque......
  • 网络爬虫(三)爬取B站视频
    尝试着用request库去爬取了一些B站视频参考和抄了一些博主的代码和思路,我略作了修改,目前是不能爬取带分页的视频(只需要略作修改,也可爬取):b站视频爬虫_哔哩哔哩_bilibiliP......
  • Python scrapy 爬取拉勾网招聘信息
    周末折腾了好久,终于成功把拉钩网的招聘信息爬取下来了。现在总结一下!环境:windows8.1+python3.5.0首先使用 scrapy创建一个项目:E:\mypy>scrapystartprojectlgjob......