爬虫——selenium、 xpath的使用

标签：xpath body selenium 爬虫 bro html div

爬虫——selenium、 xpath的使用

xpath的使用
selenium动作链
自动登录12306
打码平台的使用
使用打码平台自动登录
使用selenium爬取京东商品信息
scrapy介绍

xpath 的使用

在html中选择标签，可使用的通用方式

css选择
xpath选择

什么是xpath

XPath 是XML路径语言(XML Path Language)，是一种用来确定XML文档中某部分位置的语言

XPath 语法的简单介绍

操作	介绍
nodename	选取此节点的所有子节点
/	从根节点选取 (/body/div)
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置 //div
.	选取当前节点。
..	选取当前节点的父节点。
@	选取属性

大技能：

在前端检索行中开启CV大法

xpath案例

from lxml import etree

doc = '''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id='id_a'>Name: My image 1 <br/><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''

html = etree.HTML(doc)
# html = etree.parse('search.html',etree.HTMLParser())

# 1.获取所有节点
a = html.xpath('//*')

# 2.指定节点（结果为列表）
b = html.xpath('//head')

# 3.子节点，子孙节点
c = html.xpath('//div/a')
c1 = html.xpath('//body/a')  # 无数据
c2 = html.xpath('//body//a')

# 4.父节点
d = html.xpath('//body//a[@href="image1.html"]/..')
d1 = html.xpath('//body//a[1]/..')
d2 = html.xpath('//body//a[1]/parent::*')
d3 = html.xpath('//body//a[1]/parent::div')
# 5.匹配属性
e = html.xpath('//body//a[@href="image1.html"]')

# 6.文本获取  text()
f = html.xpath('//body//a[@href="image1.html"]/text()')

# 7.属性获取
g = html.xpath('//body//a/@href')
g1 = html.xpath('//body//a/@id')
g2 = html.xpath('//body//a[1]/@id')  # 注意从1 开始取（不是从0）

# 8.属性多值匹配
# a标签有多个class类，直接匹配就不可以了，需要用contains
h = html.xpath('//body//a[@class="li"]')
h1 = html.xpath('//body//a[@name="items"]')
h2 = html.xpath('//body//a[contains(@class,"li")]')
h3 = html.xpath('//body//a[contains(@class,"li")]/text()')

# 9.多属性匹配
i = html.xpath('//body//a[contains(@class,"li") or @name="items"]')
i1 = html.xpath('//body//a[contains(@class,"li") and @name="items"]/text()')

# 10.按序选择
j = html.xpath('//a[2]/text()')
j1 = html.xpath('//a[3]/@href')
# 取最后一个
j2 = html.xpath('//a[last()]/@href')
# 位置小于3的
j3 = html.xpath('//a[position()<3]/@href')
# 倒数第二个
j4 = html.xpath('//a[last()-2]/@href')

# 11.节点轴选择

# ancestor：祖先节点 --->使用了 * 获取所有祖先节点
k = html.xpath('//a/ancestor::*')
# # 获取祖先节点中的div
k1 = html.xpath('//a/ancestor::div')
# attribute：属性值
k2 = html.xpath('//a[1]/attribute::*')
k3 = html.xpath('//a[1]/attribute::href')

# child：直接子节点
l = html.xpath('//a[1]/child::*')
# descendant：所有子孙节点
l1 = html.xpath('//a[6]/descendant::*')

# following:当前节点之后所有节点
m = html.xpath('//a[1]/following::*')
m1 = html.xpath('//a[1]/following::*[1]/@href')
# following-sibling:当前节点之后同级节点
m2 = html.xpath('//a[1]/following-sibling::*')
m3 = html.xpath('//a[1]/following-sibling::a')
n4 = html.xpath('//a[1]/following-sibling::*[2]')
m5 = html.xpath('//a[1]/following-sibling::*[2]/@href')

selenium 动作链

selenium 动作链常用于滑动验证码，用按住鼠标来实现滑动的效果

使用selenium

安装selenium

pip3 install selenium

下载chromdriver.exe放到python安装路径的scripts目录中即可，注意最版本

国内镜像网站地址：http://npm.taobao.org/mirrors/chromedriver/

最新的版本去官网找:https://sites.google.com/a/chromium.org/chromedriver/downloads

模拟滑动框地址http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
import time

bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
bro.implicitly_wait(10)

try:
    bro.switch_to.frame('iframeResult')
    sourse = bro.find_element(by=By.ID,value='draggable')
    target = bro.find_element(by=By.ID,value='droppable')
    # 方式一：基于同一个动作链串行执行
    # actions=ActionChains(bro) #拿到动作链对象
    # actions.drag_and_drop(sourse,target) #把动作放到动作链中，准备串行执行
    # actions.perform()

    # 方式二：不同的动作链，每次移动的位移都不同
    ActionChains(bro).click_and_hold(sourse).perform()
    distance = target.location['x'] - sourse.location['x']
    print('目标距离源的x轴距离：', distance)
    track = 0
    while track < distance:
        ActionChains(bro).move_by_offset(xoffset=10, yoffset=0).perform()
        track += 10
    ActionChains(bro).release().perform()
    time.sleep(10)

except Exception as e:
    print(e)
finally:
    bro.close()

自动登录12306

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options


options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")  # 去掉自动化控制的提示
bro = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)

bro.get('https://kyfw.12306.cn/otn/resources/login.html') # 12306访问网址

bro.maximize_window() # 窗口最大化 
# 12306检测到了我们使用了selenium控制了浏览器，所以它的滑块出不来
bro.implicitly_wait(10) # 隐式等待
# 当使用了隐士等待执行测试的时候，如果 WebDriver没有在 DOM中找到元素，将继续等待，超出设定时间后则抛出找不到元素的异常 

try:
    username = bro.find_element(by=By.ID,value='J-userName')
    username.send_keys('输入的账号')
    password = bro.find_element(by=By.ID, value='J-password')
    password.send_keys('输入的密码')
    time.sleep(5)
    btn = bro.find_element(by=By.ID,value='J-login')
    btn.click()
    span = bro.find_element(by=By.ID,value='nc_1_n1z')

    ActionChains(bro).click_and_hold(span).perform()
    # 移动 move_by_offset 方法  perform()释放鼠标
    ActionChains(bro).move_by_offset(xoffset=300, yoffset=0).perform()

    time.sleep(10)

except Exception as e:
    print(e)

finally:
    bro.close()

打码平台的使用

什么是打码平台

把验证码图片发给第三方，第三方解决验证码对比，我们只需要用钱就可以了

打码平台：

云打码、超级鹰等

自动登录超级鹰

根据验证码对应的验证码类别来进行筛选识别图片验证码

在超级鹰的开发文档中提供了多版本的sdk和api接口，可根据这下载提供参数进行使用

超级鹰提供的sdk

import requests
from hashlib import md5


class ChaojiyingClient():
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def PostPic_base64(self, base64_str, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
            'file_base64': base64_str
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


if __name__ == '__main__':
    chaojiying = ChaojiyingClient('306334678', 'lqz123', '937234')  # 用户中心>>软件ID 生成一个替换 96001
    im = open('a.jpg', 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
    print(chaojiying.PostPic(im, 1902))  # 1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()

调用超级鹰的sdk来自动识别验证码

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from chaojiying import ChaojiyingClient
from PIL import Image

bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('http://www.chaojiying.com/apiuser/login/')
bro.implicitly_wait(10)  # 隐式等待 
bro.maximize_window() # 最大窗口

try:
    username = bro.find_element(by=By.XPATH,value='/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input')
    password = bro.find_element(by=By.XPATH,value='/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input')
    code = bro.find_element(by=By.XPATH,value='/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input')
    btn = bro.find_element(by=By.XPATH,value='/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input')
    username.send_keys('')
    password.send_keys('')
    # 获取验证码：
    # 1 整个页面截图
    bro.save_screenshot('main.png')
    # 2 使用pillow，从整个页面中截取出验证码图片 code.png
    img = bro.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/div/img')
    location = img.location
    size = img.size

    # 使用pillow扣除大图中的验证码
    img_tu = (
    int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
    # # 抠出验证码
    # #打开
    img = Image.open('./main.png')
    # 抠图
    fram = img.crop(img_tu)
    # 截出来的小图
    fram.save('code.png')
    # 3 使用超级鹰破解
    chaojiying = ChaojiyingClient('306334678', 'lqz123', '937234')  # 用户中心>>软件ID 生成一个替换 96001
    im = open('code.png', 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
    print(chaojiying.PostPic(im, 1902))  # 1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()
    res_code = chaojiying.PostPic(im, 1902)['pic_str']
    code.send_keys(res_code)
    time.sleep(5)
    btn.click()
    time.sleep(10)

except Exception as e:
    print(e)
finally:
    bro.close()

爬取京东商品信息

from selenium import webdriver
from selenium.webdriver.common.by import By  # 按照什么方式查找，By.ID,By.CSS_SELECTOR
import time
from selenium.webdriver.common.keys import Keys

def get_goods(driver):
    try:
        goods = driver.find_elements(by=By.ID,value='gl-item')
        for good in goods:
            name = good.find_element(by=By.CSS_SELECTOR, value='.p-name em').text
            price = good.find_element(by=By.CSS_SELECTOR, value='.p-price i').text
            commit = good.find_element(by=By.CSS_SELECTOR, value='.p-commit a').text
            url = good.find_element(by=By.CSS_SELECTOR, value='.p-name a').get_attribute('href')
            img = good.find_element(by=By.CSS_SELECTOR, value='.p-img img').get_attribute('src')
            if not img:
                img = 'https://' + good.find_element(by=By.CSS_SELECTOR, value='.p-img img').get_attribute('data-lazy-img')

            print('''
                        商品名字：%s
                        商品价格：%s
                        商品链接：%s
                        商品图片：%s
                        商品评论：%s
                        ''' % (name, price, url, img, commit))
        button = driver.find_element(by=By.PARTIAL_LINK_TEXT, value='下一页')
        button.click()
        time.sleep(1)
        get_goods(driver)
    except Exception as e:
        print(e)

def spider(url,keyword):
    driver = webdriver.Chrome(executable_path='./chromedriver.exe')
    driver.get(url)
    driver.implicitly_wait(10)  # 使用隐式等待

    try:
        input_tag = driver.find_element(by=By.ID, value='key')
        input_tag.send_keys(keyword)
        input_tag.send_keys(Keys.ENTER)
        get_goods(driver)

    finally:
        driver.close()

if __name__ == '__main__':
    spider('https://www.jd.com/', keyword='精品内衣')

scrapy

scrapy是爬虫的一个框架，其重要程度相当于python中的django

scrapy把爬虫所用的东西都封装好了，使用的时候只需在固定的位置写固定的代码即可

scrapy介绍

Scrapy一个开源和协作的框架，其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛，可用于如数据挖掘、监测和自动化测试等领域，也可以应用在获取API所返回的数据或者通用的网络爬虫

scrapy的安装

mac，linux：

pip3 install scrapy

win：看人品

pip3 install scrapy

安装出现问题的解决办法

1、pip3 install wheel #安装后，便支持通过wheel文件安装软件   xx.whl
3、pip3 install lxml
4、pip3 install pyopenssl
5、下载并安装pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/
6、下载twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
8、pip3 install scrapy

scrapy创建项目

1.释放出scrapy 可执行文件
以后使用这个创建爬虫项目 ---》django-admin创建django项目
2.创建爬虫项目
scrapy startproject myfirstscrapy
3.创建爬虫 [django创建app]
scrapy genspider cnblogs www.cnblogs.com
4.启动爬虫 	
scrapy crawl cnblogs --nolog
5.pycharm中运行
新建run.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'cnblogs','--nolog'])

标签：xpath,body,selenium,爬虫,bro,html,div
From： https://www.cnblogs.com/nirvana001/p/16960605.html

爬虫——selenium、 xpath的使用

爬虫——selenium、 xpath的使用

xpath 的使用

在html中选择标签，可使用的通用方式

什么是xpath

XPath 语法的简单介绍

xpath案例

selenium 动作链

使用selenium

自动登录12306

打码平台的使用

什么是打码平台

打码平台：

自动登录超级鹰

超级鹰提供的sdk

调用超级鹰的sdk来自动识别验证码

爬取京东商品信息

scrapy

scrapy介绍

scrapy的安装

scrapy创建项目

相关文章

赞助商

阅读排行