使用虚拟环境创建一个 selenium版本>4

因为反爬比较严重这里没用协议弄采用selenium

思路：1.先用selenium，获取网页（这里获取外页，内页请求量太大），2.再解析得到我们想要的结果 -----避免爬一半被反爬了

部分图

实现代码如下

# 1.获取网页
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from time import sleep
import random

from selenium.webdriver.common.by import By


def main():
    for p in range(200):
        p += 1
        print(f'爬取第{p}页>>>')
        sleep(5 * random.random())
        for i in range(140):
            sleep(random.random() / 5)  # 这里睡眠时间随机 避免被误认为机器
            driver.execute_script('window.scrollBy(0, 50)')  # By向下拉50px 不能一下刷到底，否则中途丢失数据包  # window.scrollTo 是滚到那边
        res = driver.page_source   # 获取网页
        open(f'html/{p}.html', 'w', encoding='utf-8').write(res)
        if p != 200:
            driver.find_element(By.ID, 'jump_page').clear()
            driver.find_element(By.ID, 'jump_page').send_keys(p + 1)
            sleep(random.random())
            driver.find_element(By.CLASS_NAME, 'jumpPage').click()  # 跳转下一页继续获取网页


if __name__ == '__main__':
    options = ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    driver = webdriver.Chrome(options=options)
    js = open('stealth.min.js').read()    # # stealth.min.js这个文件是puppeteer中用于抹去自动化程序特征的。当他被单独提取出来后就可以在selenium中加载并使用，使得可以抹掉selenium中的自动化特征，从而绕过一些网站或者验证程序的机器人检测。
    driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': js})
    driver.get('https://we.51job.com/pc/search?keyword=&searchType=2&sortType=0&metro=')  # 杭州地区
    sleep(5)
    main()
    driver.quit()

# 解析
from lxml import etree
import pandas as pd


def collect():
    resLs = []
    for i in range(200):
        i += 1
        res = open(f'html/{i}.html', encoding='utf-8').read()
        tree = etree.HTML(res)
        for li in tree.xpath('//div[@class="j_joblist"]/div'):
            name = li.xpath('.//span[@class="jname at"]/text()')[0]
            href = li.xpath('./a/@href')[0]
            time = li.xpath('.//span[@class="time"]/text()')[0]
            sala = (li.xpath('.//span[@class="sal"]/text()') + [''])[0]
            addr = (li.xpath('.//span[@class="d at"]/span/text()') + [''] * 5)[0] # 有些为空，扩长
            exp = (li.xpath('.//span[@class="d at"]/span/text()') + [''] * 5)[2]
            edu = (li.xpath('.//span[@class="d at"]/span/text()') + [''] * 5)[4]
            comp = li.xpath('.//a[@class="cname at"]/text()')[0]
            kind = li.xpath('.//p[@class="dc at"]/text()')[0].split('|')[0].strip()
            num = (li.xpath('.//p[@class="dc at"]/text()')[0].split('|') + [''])[1].strip()
            ind = (li.xpath('.//p[@class="int at"]/text()') + [''])[0]
            dic = {
                '标题': name,
                '链接': href,
                '时间': time,
                '薪资': sala,
                '地区': addr,
                '经验': exp,
                '学历': edu,
                '公司': comp,
                '类型': kind,
                '规模': num,
                '行业': ind
            }
            print(dic)
            resLs.append(dic)
    pd.DataFrame(resLs).to_excel('前程无忧杭州.xlsx', index=False, encoding='utf-8')


if __name__ == '__main__':
    collect()

标签：slenium,200,span,xpath,text,driver,li,爬取,class
From： https://www.cnblogs.com/socoo-/p/17094959.html

2009-05-第一次专业对口的面试“默写 DbHelper”
两人没在一起的时间，刘文轩也一直在准备简历。可一直没等到合适的面试。有一次学习安排的一次面试是在渝北特别远的一个公司，面试的人就在一个办公室里给我们找了台电脑并......
记一次selenium爬取p站图片的经历
突发奇想,爬取p站图片做个壁纸图库(bukemiaoshu),当然这里有许多的门槛,但是为了实现理想,暂时没想那么多了,直接开干(不是专业做测试和自动化的,如有大佬请评论指教!!!)1......
新闻文本爬取——以央广网为例
目录crawlingcrawling1.xcrawling1.0crawling2.xcrawling2.0crawling2.1crawling3.xcrawling3.0crawling3.1crawling3.2crawling3.3crawlingcrawling1.xcrawling1.0imp......
免费游戏加速器，支持2000+热门游戏免费加速，某游戏加速器正在内测，登录即享免费加速，支持
很多游戏加速器都是初期免费，等用户积累到一定程度、名声也打出去了，就开始收费了。现在，一个白嫖的机会就在眼前，给大家搞到一款正在内测的游戏加速器，登录即享2000+游戏和Stea......
美团一面：InndoDB 单表最多 2000W，为什么？小伙伴竟然面挂
在疯狂创客圈的社群面试交流中，有很多小伙伴在面大厂，经常遇到下面的问题：问题1：在实际生产环境中，InnoDB中一棵B+树索引一般有多少层？问题2：在实际生产环境中，InnoDB一棵B......
GPIO模拟串口TX与RX，波特率115200
使用单片机的GPIO口去模拟串口的TX与RX进行数据的发送和接收处理，里面主要需要关注的和使用的为：GPIO的初始化，时钟频率的设置，引脚中断的设置。模拟串口的TX：首先初始化......
缩点-DAG 性质的应用：P2002，P1262，P2341，P1407，P2746（P2812）
缩点不只用于转化图为DAG，还可以进一步发掘图的性质，从而将题目变成结论题所求信息转化到图上，方便建模。P2002https://www.luogu.com.cn/problem/P2002在一个强连通分量......
64爬取b站，微博，ai问答等数据写入excel
#功能1：获取手机号归属地#功能2：查询天气#功能3：查询百度热搜#功能4：查询微博热搜#功能5：查询b站#功能6ai问答（在这用不了涉及网站逆向写在另外一个py模块，没写入到......
淘宝天猫商品详情爬取方案app详情sku数据如何获取？
背景商品详情包含了非常多的数据，如sku、价格、库存、店铺名称、店铺logo、开店时间、旺旺、主图、标题等等，很多行业都有需要，比如电商相关行业、淘客、电商软件等都需要用到......
SqlServer2008R2 sqltext的参数化处理
sqlserver的缓存包括Datacache和Plancache，其中Plancache包括上一篇生成的xml结构和sqltext，sqltext还可以做到参数化，也就是模板化了。1.sql参数化(1).先来做一个Person......

67使用slenium自动化爬取200页职位信息（也可以用playwright）

使用虚拟环境创建一个 selenium版本>4

相关文章

赞助商

阅读排行

67使用slenium自动化爬取200页职位信息（也可以用playwright）

使用虚拟环境 创建一个 selenium版本>4

相关文章

赞助商

阅读排行

使用虚拟环境创建一个 selenium版本>4