首页 > 其他分享 >爬虫相关 selenium登录cnblogs、抽屉半自动点赞、xpath的使用、selenium动作链、自动登录12306(带自动化屏蔽)、打码平台的使用(截取验证码图片)、使用打码平台自动登录、

爬虫相关 selenium登录cnblogs、抽屉半自动点赞、xpath的使用、selenium动作链、自动登录12306(带自动化屏蔽)、打码平台的使用(截取验证码图片)、使用打码平台自动登录、

时间:2023-03-20 18:58:35浏览次数:33  
标签:登录 selenium find bro import 打码 element div

selenium登录cnblogs

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import json

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

try:
    #####1 获取cookie
    # bro.get('https://www.cnblogs.com/')
    # bro.implicitly_wait(10)
    # login_btn = bro.find_element(by=By.LINK_TEXT,value='登录')
    # login_btn.click()
    # username = bro.find_element(By.ID,'mat-input-0')
    # password = bro.find_element(By.ID,'mat-input-1')
    # submit_btn = bro.find_element(By.CSS_SELECTOR,'body > app-root > app-sign-in-layout > div > div > app-sign-in > app-content-container > div > div > div > form > div > button')
    #
    # username.send_keys('[email protected]')
    # # 手动输入密码,手动点击登录,搞好验证码,都成功敲回车
    # input()
    # # 取出cookies
    # cookie = bro.get_cookies()
    # print(cookie)
    # with open('cnblogs.json','w',encoding='utf-8')as f:
    #     json.dump(cookie,f)
    ### # 2 打开首页
    bro.get('https://www.cnblogs.com/')  # 没有登录状态
    bro.implicitly_wait(10)
    time.sleep(2)
    # 打开本地的cookie的json文件
    with open('./cnblogs.json', 'r', encoding='utf-8') as f:
        cookies = json.load(f)
    for cookie in cookies:
        bro.add_cookie(cookie)

    bro.refresh()  # 刷新
    time.sleep(5)


except Exception as e:
    print(e)

finally:
    bro.close()


抽屉半自动点赞

1  使用selenium 半自动登录---》取到cookie
2  使用requests模块,解析出点赞的请求地址---》模拟发送请求  ---》携带cookie


from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import json
import requests

# bro = webdriver.Chrome(executable_path='./chromedriver.exe')

try:

    ####1 先登录,获取cookie
    # bro.get('https://dig.chouti.com/')
    # bro.maximize_window()
    # bro.implicitly_wait(10)
    #
    # login_btn = bro.find_element(By.LINK_TEXT, '登录')
    # # login_btn.click() # 会报错,不能这么点击了
    #
    # # 使用js点击它,把传入的对象,点击一下
    # bro.execute_script("arguments[0].click()", login_btn)
    # time.sleep(3)
    #
    # username = bro.find_element(By.CSS_SELECTOR,
    #                             'body > div.login-dialog.dialog.animated2.scaleIn > div > div.login-body > div.form-item.login-item.clearfix.phone-item.mt24 > div.input-item.input-item-short.left.clearfix > input')
    # password = bro.find_element(By.NAME, 'password')
    #
    # username.send_keys('18953675221')
    # password.send_keys('lqz123')
    # time.sleep(1)
    # submit_btn = bro.find_element(By.CSS_SELECTOR,
    #                               'body > div.login-dialog.dialog.animated2.scaleIn > div > div.login-footer > div:nth-child(4) > button')
    # submit_btn.click()
    #
    # input('')  # 万一有验证码,手动操作一下
    # with open('chouti.json', 'w', encoding='utf-8') as f:
    #     json.dump(bro.get_cookies(), f)

    ### 使用request模拟点赞,携带cookie
    # 先把cookie打开
    with open('chouti.json', 'r', encoding='utf-8') as f:
        cookies = json.load(f)

    # selenium 的cookie不能直接给requests模块使用,需要额外处理一下

    request_cookies = {}
    for cookie in cookies:
        request_cookies[cookie['name']] = cookie['value']

    print(request_cookies)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
    }
    res = requests.get('https://dig.chouti.com/top/24hr?_=1679277434856', headers=headers)

    for item in res.json().get('data'):
        id_link = item.get('id')
        data = {
            'linkId': id_link
        }
        res2 = requests.post('https://dig.chouti.com/link/vote', headers=headers, data=data, cookies=request_cookies)
        print(res2.text)

except Exception as e:
    print(e)

finally:
    # bro.close()
    pass

xpath使用

# 每个解析器,都会有自己的查找方法
   bs4  find和find_all
   selenium  find_element和find_elements
   lxml 也是个解析器,支持xpth和css

这些解析器 基本上都会支持两种统一的css和xpath
    css
    xpath需要学习


xpath是什么?
   XPath即为XML路径语言(XML Path Language),它是一种用来确定XML文档中某部分位置的语言
   
   
只需要记住介个用法就可以了
  /  从当前路径下开始找
  /div 从当前路径下开始div
  //   递归查找,子子孙孙
  //div 递归查找div
  @     取属性
  .     当成
  ..    上一层


selenium动作链

人可以滑动某些标签

网站中有些按住鼠标,滑动的效果
   滑动验证码
   
两种形式
   形式一:
    actions=ActionChains(bro) #拿到动作链对象
    actions.drag_and_drop(sourse,target) #把动作放到动作链中,准备串行执行
    actions.perform()

   形式二:
      ActionChains(bro).click_and_hold(sourse).perform()
        distance=target.location['x']-sourse.location['x']
        track=0
        while track < distance:
            ActionChains(bro).move_by_offset(xoffset=2,yoffset=0).perform()
            track+=2

动作链案例

import time
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By  # 按照什么方式查找,By.ID,By.CSS_SELECTOR
from selenium.webdriver.common.keys import Keys  # 键盘按键操作

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait  # 等待页面加载某些元素

try:
    browser = webdriver.Chrome(executable_path='./chromedriver.exe')
    browser.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
    browser.switch_to.frame('iframeResult')  # 切换到id为iframeResult的frame
    target = browser.find_element(By.ID, 'droppable')  # 目标

    source = browser.find_element(By.ID, 'draggable')  # 源

    # 方案1
    # actions = ActionChains(browser)  # 拿到动作链对象
    # actions.drag_and_drop(source, target)  # 把动作放到动作链中,准备串行执行
    # actions.perform()
    # 方案2
    ActionChains(browser).click_and_hold(source).perform()
    distance = target.location['x'] - source.location['x']
    track = 0
    while track < distance:
        ActionChains(browser).move_by_offset(xoffset=6, yoffset=0).perform()
        track += 6

    time.sleep(2)

finally:
    browser.close()


自动登录12306(去掉自动化控制检测)

selenium自动登录12306
import time
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By  # 按照什么方式查找,By.ID,By.CSS_SELECTOR
from selenium.webdriver.common.keys import Keys  # 键盘按键操作

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait  # 等待页面加载某些元素
from selenium.webdriver.chrome.options import Options


try:
   options = Options()
   options.add_argument("--disable-blink-features=AutomationControlled")  # 去掉自动化控制
   browser = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=options)
   browser.get('https://kyfw.12306.cn/otn/resources/login.html')
   browser.maximize_window()  # 窗口放到最大
   username = browser.find_element(By.ID,'J-userName')
   password = browser.find_element(By.ID,'J-password')
   username.send_keys('') # 输入用户名
   password.send_keys('')  # 输入密码
   login_btn = browser.find_element(By.ID,'J-login')

   time.sleep(2)
   login_btn.click()
   time.sleep(5)

   span = browser.find_element(By.ID,'nc_1_n1z')

   ActionChains(browser).click_and_hold(span).perform()
   ActionChains(browser).move_by_offset(xoffset=300, yoffset=0).perform()

   # 滑动完成了,但是进不去 ,原因是它检测到使用了selenium,屏蔽掉

   time.sleep(3)

finally:
    browser.close()

打码平台使用

登录网站,会有些验证码,可以借助于第三方的打码平台,破解验证码,只需要花钱解决

免费的:纯数字,纯字母的---》python有免费模块破解,破解率不高

#  云打码,超级鹰(以它为例)

云打码:https://zhuce.jfbym.com/price/

# 价格体系:破解什么验证码,需要多少钱
http://www.chaojiying.com/price.html

使用打码平台自动登录(页面截图)

使用selenium 打开页面---》截取整个屏幕----》使用pillow ---》根据验证码图片位置,截取出验证码图片---》使用第三方打码平台破解---》写入到验证码框中,点击登录

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from chaojiying import ChaojiyingClient
from PIL import Image
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('http://www.chaojiying.com/apiuser/login/')
bro.implicitly_wait(10)
bro.maximize_window()
try:
    username = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input')
    password = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input')
    code = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input')
    btn = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input')
    username.send_keys('306334678')
    password.send_keys('lqz123')
    # 获取验证码:
    #1 整个页面截图
    bro.save_screenshot('main.png')
    # 2 使用pillow,从整个页面中截取出验证码图片 code.png
    img = bro.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/div/img')
    location = img.location
    size = img.size
    print(location)
    print(size)
    # 使用pillow扣除大图中的验证码
    img_tu = (int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
    # # 抠出验证码
    # #打开
    img = Image.open('./main.png')
    # 抠图
    fram = img.crop(img_tu)
    # 截出来的小图
    fram.save('code.png')
    # 3 使用超级鹰破解
    chaojiying = ChaojiyingClient('306334678', 'lqz123', '937234')  # 用户中心>>软件ID 生成一个替换 96001
    im = open('code.png', 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
    print(chaojiying.PostPic(im, 1902))  # 1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()
    res_code=chaojiying.PostPic(im, 1902)['pic_str']
    code.send_keys(res_code)
    time.sleep(5)
    btn.click()
    time.sleep(10)
except Exception as e:
    print(e)
finally:
    bro.close()

使用selenium爬取京东商品信息

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys  # 键盘按键操作


def get_goods(bro):
    # 找到所有类名叫gl-item 的li标签
    li_list = bro.find_elements(By.CLASS_NAME, 'gl-item')
    for li in li_list:
        try:
            img_url = li.find_element(By.CSS_SELECTOR, '.p-img img').get_attribute('src')
            if not img_url:
                img_url = 'https:' + li.find_element(By.CSS_SELECTOR, '.p-img img').get_attribute('data-lazy-img')
            price = li.find_element(By.CSS_SELECTOR, '.p-price i').text
            name = li.find_element(By.CSS_SELECTOR, '.p-name a').text
            url = 'https:' + li.find_element(By.CSS_SELECTOR, '.p-img a').get_attribute('href')
            commit = li.find_element(By.CSS_SELECTOR, '.p-commit a').text
            print('''
            商品图片地址:%s
            商品地址:%s
            商品名字:%s
            商品价格:%s
            商品评论数:%s
            ''' % (img_url, url, name, price, commit))
        except Exception as e:
            print(e)
            continue

    # 查找下一页,点击,在执行get_goods
    next = bro.find_element(By.PARTIAL_LINK_TEXT, '下一页')
    time.sleep(1)
    next.click()
    get_goods(bro)


try:
    bro = webdriver.Chrome(executable_path='./chromedriver.exe')
    bro.get('http://www.jd.com')
    bro.implicitly_wait(10)

    input_key = bro.find_element(By.ID, 'key')
    input_key.send_keys('茅台')
    input_key.send_keys(Keys.ENTER)  # 敲回车
    # 滑动屏幕到最底部
    bro.execute_script('scrollTo(0,5000)')
    get_goods(bro)



except Exception as e:
    print('sasdfsadfasdfa',e)
finally:
    bro.close()

scrapy介绍

# requests  bs4  selenium 模块

# 框架:django, scrapy---》专门做爬虫的框架,爬虫界的django,大而全,爬虫有的东西,它都自带

# 安装 (win看人品,linux,mac 一点问题没有)
   pip3.8 install scrapy
   
   装不上,基本上是因为twisted装不了,单独装
        1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
        3、pip3 install lxml
        4、pip3 install pyopenssl
        5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
        6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
        7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
        8、pip3 install scrapy
        
        
        
# 架构分析
     爬虫:spiders(自己定义的,可以有很多),定义爬取的地址,解析规则
     引擎:engine ---》控制整个框架数据的流动,大总管
     调度器:scheduler---》要爬取的 requests对象,放在里面,排队
     下载中间件:DownloaderMiddleware---》处理请求对象,处理响应对象
     下载器:Downloader ----》负责真正的下载,效率很高,基于twisted的高并发的模型之上
    
     爬虫中间件:spiderMiddleware----》处于engine和爬虫直接的(用的少)
     管道:piplines---》负责存储数据
    
    

# 创建出scrapy项目cmd下执行
    scrapy startproject firstscrapy

# 创建项目  cd到目录下
    scrapy genspider 名字 网址        # 创建爬虫   等同于 创建app

    # pycharm打开

标签:登录,selenium,find,bro,import,打码,element,div
From: https://www.cnblogs.com/xm15/p/17237307.html

相关文章