爬虫基础

标签：name chrome 基础爬虫 re print import div

爬虫概述
HTTP协议
requests库
re模块
BeautifulSoup库
XPath
- - - 1.简介
    - 2.XPath语法规则
CSS选择器
scrapy
- - - 1.环境搭建
    - 2.Selector
pymysql/peewe
- - - 1.pymysql
    - 2.peewe
Selenium
- - - 1.安装及配置
    - 2.实现模拟登录
反爬/反反爬
- - - 1.概念
    - 2.反爬及反反爬策略

爬虫概述

爬虫：一种程序，用于获取互联网上的数据，执行自动化测试等
robots.txt：一个协议，规定了网站中哪些数据可以被爬取，哪些不可以被爬取。一般位于网站的根目录下，例如知乎网站访问这个文件的路径为https://www.zhihu.com/robots.txt
爬虫的采集方案分类
1. 利用HTTP协议采集
2. 利用api接口采集
3. 利用目标网站向大众提供的api进行采集
一个简单的爬虫示例:爬取网站的html文件

from urllib.request import urlopen

url = "http://www.baidu.com"
resp = urlopen(url)

with open("baidu.html",mode="w",encoding="utf-8") as f:
    f.write(resp.read().decode("utf-8"))

HTTP协议

http协议是web开发中以及爬虫需要熟悉的内容，这一部分详见图解HTTP

requests库

这个第三方库提供了很多的方法，比如说用于模拟客户端浏览器向服务器发送各种不同的请求类型，获取HTTP响应报文的状态行，响应头，响应体等
简单示例如下：访问百度官网首页

import requests

params = {
    'name': 'NRVCER',
    'age': 23
}
headers = {
    'uid': '1001',
    'date': '2024/1/11',
}

ret = requests.get('http://www.baidu.com', params=params, headers=headers)
print(ret.text)     # Content of the response
print(ret.url)      # path，http://www.baidu.com/?name=NRVCER&age=23
print(ret.headers)  # response headers

需要特别注意服务器响应的数据的编码。

re模块

1.元字符

.：匹配除"\r","\n"之外的任意单个字符
^：匹配开始位置，多行模式下匹配每一行的开始

print(re.findall('^h...o','hellftesthello'))   #[]
print(re.findall('^h...o','hellotesthello'))   #['hello']

$：匹配结束位置，多行模式下匹配每一行的结束
*：匹配前一个元字符0到多次

print(re.findall('h*','hellohhhh')) #['h', '', '', '', '', 'hhhh', '']

+：匹配前一个元字符1到多次

print(re.findall('h+','hellohhhh')) #['h', 'hhhh']

?：匹配前一个元字符0到一次
{m,n}：匹配前一个元字符m到n次
\\：转义字符，跟在其后的字符将失去作为特殊元字符的含义。比如说\\.只能匹配.
[]：字符集，可以匹配其中任意一个字符
|：或的意思，比如说a|b表示可以匹配a或者b
\b：匹配一个边界，这个边界指的是单词和特殊字符间的位置

print(re.findall(r'hi\b','hi'))    #['hi']
print(re.findall(r'hi\b','hi$hi testhi^'))    #['hi', 'hi', 'hi']

\B：匹配非边界

print(re.findall(r'hi\B','ti$hitesthi^'))    #['hi']

\d：匹配任意一个数字，相当于[0-9]
\D：匹配任意一个非数字字符,相当于[^0-9]
\s：匹配任意一个空白字符，空白字符比如说 ,\t,'\n','\r','\f','\v'；等同于[ \t\n\r\f\v]

import re
# [' ', ' ', ' ', ' ', ' ', '\r', '\n', '\t', '\x0c', '\x0b']
print(re.findall('\s','test_     \r\n\t\f\vtest'))

\S：匹配任意一个非空白字符，等同于[^ \t\n\r\f\v]
\w：匹配数字、字母、下划线中任意一个字符；等同于 [a-zA-Z0-9_]
\W：匹配非数字、字母、下划线中的任意一个字符，等同于[^a-zA-Z0-9_]

2.模式

模式就是一些RegexFlag

DOTALL：这个模式下.的匹配不受限制，可匹配任何字符，包括换行符

3.函数

python中的内置模块re提供了很多函数。

match：从字符串开始进行匹配，如果开始处不匹配则不再继续寻找

s = '''first line
second line
third line'''

regex = re.compile("s\w+")
print(regex.match(s)) # None


import re

info = "my name is XCER, 生日：2000/1/14，毕业于2024/7/1。"
result = re.match(".*生日.*?(\d{4}).*毕业于.*?(\d{4})", info)

for i in range(len(result.groups())):
    print(result.group(i + 1))

sub：用于替换字符串
search：函数类似于match，不同之处在于不限制正则表达式的开始匹配位置

import re

info = "my name is XCER, 生日：2000/1/14，毕业于2024/7/1。"
result = re.search("生日.*?(\d{4}).*毕业于.*?(\d{4})", info)

for i in range(len(result.groups())):
    print(result.group(i + 1))

findall：返回一个匹配时符合正则表达式规则的列表

4. 方法

5.分组

被分组的内容可以被单独取出，默认每个分组有个索引，从1开始，按照"("的顺序决定索引值

BeautifulSoup库

这个库用于解析HTML或者XML文件，从中提取数据。
简单示例如下：来自官网

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# represents the document as a string
# print(soup.prettify())

# <title>The Dormouse's story</title>
print(soup.title)

# u'title'
print(soup.title.name)

# u'The Dormouse's story'
print(soup.title.string)

# u'head'
print(soup.title.parent.name)

# <p class="title"><b>The Dormouse's story</b></p>
print(soup.p)

# ['title']
print(soup.p['class'])

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.a)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.find_all('a'))

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
print(soup.find(id="link3"))

# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
print(soup.get_text())

XPath

1.简介

全称XML Path Language，即XML路径语言。这个语言可以用于编写解析HTML或者XML文件的语句，方便从中提取数据。

2.XPath语法规则

/：表示从根节点选取。
1. /html表示选取根元素html
2. /article/div[1]：选取属于article子元素的第一个div元素
3. /article/div[last()]：选取属于article子元素的最后一个div元素
4. /article/div[last()-1]：选取属于article子元素的倒数第二个div元素
5. /div/*：选取属于div元素的所有子节点
//：表示从子孙节点选取。
1. //div：表示从子孙结点开始选取所有div元素
2. //div[@class]：表示从子孙结点开始选取所有拥有class属性的div元素
3. //div[@class='value']：表示从子孙结点开始选取所有拥有class属性，其属性值为value的div元素
4. //*：选取所有元素
5. //div[@*]：选取所有带属性的div元素
6. //div/a | //div/p：选取所有div元素的子元素a或者p
7. //div[@id='value']：表示从子孙结点开始选取所有拥有id属性，其属性值为value的div元素
./：表示从当前节点选取
..：表示从当前节点的父节点选取
@属性名称：选取所有指定属性名的属性

CSS选择器

这一部分详见CSS基础，主要熟悉CSS中的选择器

scrapy

scrapy是一个网络爬虫框架

1.环境搭建

安装依赖：pip install Scrapy
创建爬虫项目：这里以创建一个项目名称为scrapy_test的爬虫项目为例
1. scrapy startproject scrapy_test
2. 切换到项目根目录下：cd scrapy_test
3. scrapy genspider csdn csdn.com
4. 运行：scrapy crawl csdn。通常为了方便调试，在项目根目录下新建一个脚本文件，编辑如下：
```
from scrapy.cmdline import execute

# execute(["scrapy", "crawl", "爬虫名称"])
execute(["scrapy", "crawl", "csdn"])
```
修改项目的配置文件settings.py

# 不遵守robots.txt规则
ROBOTSTXT_OBEY = False
# 设置user-agent字段的值
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
# 最大并发数
CONCURRENT_REQUESTS = 2
# Configure a delay for requests for the same website
DOWNLOAD_DELAY = 3
# 是否保存cookie
COOKIES_ENABLED = False
# 默认请求头
DEFAULT_REQUEST_HEADERS = {
   "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
   "Accept-Language": "en",
}
# 日志等级
LOG_LEVEL = "INFO"
# Configure item pipelines
ITEM_PIPELINES = {
   "scrapy_test.pipelines.ScrapyTestPipeline": 300,
}

2.Selector

这个类提供了css方法，可以使用CSS选择器的语法用于选择HTML的元素，方便后续提取数据。示例如下：

from scrapy import Selector

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>这是一个测试HTML文档</title>
</head>
<body>
    <div class="info first" id="intro">
        <p class="age">年龄：23</p>
        <p class="name">姓名：XCER</p>
        <p class="work">职业：student</p>
        <p>性别：男</p>
    </div>
    <div class="info second" id="photo">
        <image src="" alt="帅照"/>
    </div>
</body>
</html>
"""

sel = Selector(text=html)

# 使用XPath语法时可以结合方法使用
# 选择包含class属性为info的div元素
div = sel.css(".info")
print(div)

info = sel.css("div p::text").extract()
# ['年龄：23', '姓名：XCER', '职业：student', '性别：男']
print(info)

age = sel.css("div p.age::text").extract()
if age:
    # 年龄：23
    print(age[0])

name = sel.css("div p[class='name']::text").extract()
if name:
    # 姓名：XCER
    print(name[0])

work = sel.css("div p:nth_child(3)::text").extract()
if work:
    # 职业：student
    print(work[0])

这个类提供了xpath方法，可以使用XPath语法用于选择HTML的元素，方便后续提取数据。示例如下

from scrapy import Selector

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>这是一个测试HTML文档</title>
</head>
<body>
    <div class="info first" id="intro">
        <p class="age">年龄：23</p>
        <p class="name">姓名：XCER</p>
        <p class="work">职业：student</p>
        <p>性别：男</p>
    </div>
    <div class="info second" id="photo">
        <image src="" alt="帅照"/>
    </div>
</body>
</html>
"""

sel = Selector(text=html)

# 使用XPath语法时可以结合方法使用
# 选择包含class属性为info的div元素
div = sel.xpath("//div[contains(@class, 'info')]").extract()
print(div)

info = sel.xpath("//div/p/text()").extract()
# ['年龄：23', '姓名：XCER', '职业：student', '性别：男']
print(info)

age = sel.xpath("//p[@class='age']/text()").extract()
if age:
    # 年龄：23
    print(age[0])

name = sel.xpath("//div[@id='intro']/p[last()-2]/text()").extract()
if name:
    # 姓名：XCER
    print(name[0])

work = sel.xpath("//div/p[3]/text()").extract()
if work:
    # 职业：student
    print(work[0])

pymysql/peewe

因为爬虫涉及到数据的存储，需要使用到数据库。因此这里了解一下python操作MYSQL数据库

1.pymysql

pymysql是一个包，可以作为一个MySQL客户端访问MySQL数据库

安装依赖：pip install PyMySQL[rsa]

简单示例：来自官方文档

表结构如下：

CREATE TABLE `users` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `email` varchar(255) COLLATE utf8_bin NOT NULL,
    `password` varchar(255) COLLATE utf8_bin NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
AUTO_INCREMENT=1 ;

操作users表代码示例如下，MySQL服务器版本为8.0.35

import pymysql.cursors

# Connect to the database
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             database='spider',
                             cursorclass=pymysql.cursors.DictCursor)

with connection:
    with connection.cursor() as cursor:
        # Create a new record
        sql = "INSERT INTO `users` (`email`, `password`) VALUES (%s, %s)"
        cursor.execute(sql, ('[email protected]', '123456'))

    # connection is not autocommit by default. So you must commit to save
    # your changes.
    connection.commit()

    with connection.cursor() as cursor:
        # Read a single record
        sql = "SELECT `id`, `password` FROM `users` WHERE `email`=%s"
        cursor.execute(sql, ('[email protected]',))
        result = cursor.fetchone()
        print(result)

2.peewe

peewe是一个ORM框架。

安装依赖：pip install peewee
peewe中字段的定义和MySQL中字段的对应关系如下：https://docs.peewee-orm.com/en/latest/peewee/models.html

简单示例：

创建表：

from peewee import *

db = MySQLDatabase("spider", host="127.0.0.1", port=3306, user="root", password="123456")


# Model definition，一个Model class 等同于一张表，表名为类名小写
class Person(Model):
    name = CharField(max_length=16)
    birthday = DateField()

    class Meta:
        database = db

db.create_tables([Person])

表记录的增删改查

from peewee import *

db = MySQLDatabase("spider", host="127.0.0.1", port=3306, user="root", password="123456")

# Model definition，一个Model class 等同于一张表，表名为类名小写
class Person(Model):
    name = CharField(max_length=16)
    birthday = DateField()

    class Meta:
        database = db

#记录的增、删、改、查
if __name__ == "__main__":
    from datetime import date

    #增
    user_xcer = Person(name='xcer', birthday=date(2000, 1, 15))
    user_xcer.save()  # user_xcer is now stored in the database

    user_nrv = Person(name='nrv', birthday=date(1988, 8, 25))
    user_nrv.save()  # user_nrv is now stored in the database

    #查，get方法在取不到数据会抛出异常
    try:
        user = Person.select().where(Person.name == 'nrv').get()
        if user:
            print(user.name)
    except DoesNotExist as e:
        print("查询的用户不存在")

    query = Person.select().where(Person.name == 'xcer')
    for person in query:
        # 删
        person.delete_instance()

        # person.birthday = date(2023, 1, 17)
        # # 改
        # person.save()

Selenium

1.安装及配置

一个支持web浏览器自动化的综合项目

安装依赖：pip install -i https://pypi.douban.com/simple/ selenium
安装与自己浏览器版本兼容的ChromeDriver
简单示例如下：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# 指定chromedriver的位置
service = Service("D:/chrome/chromedriver-win64/chromedriver-win64/chromedriver.exe")
options = Options()
#指定chrome浏览器二进制文件的位置
options.binary_location = "D:/chrome/Google/Chrome/Application/chrome.exe"

chrome_browser = webdriver.Chrome(options=options, service=service)
chrome_browser.get('http://www.baidu.com')

# 页面HTML代码
# print(chrome_browser.page_source)

chrome_browser.close()

2.实现模拟登录

对于需要账号密码登录的网站，可以使用Selenium实现模拟登录，获取服务器返回的cookie。
示例如下：模拟人为登录，获取服务器返回的cookie

import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

URL = "https://gpu.ai-galaxy.cn/login"


def login(username, password, chrome_browser):
    chrome_browser.get(URL)
    time.sleep(1)

    # 窗口最大化
    chrome_browser.maximize_window()

    # 根据XPath语法查找各个元素
    username_ele = chrome_browser.find_element(By.XPATH, '//input[@placeholder="请输入手机号"]')
    password_ele = chrome_browser.find_element(By.XPATH, '//input[@placeholder="请输入密码"]')
    username_ele.send_keys(username)
    password_ele.send_keys(password)


    login_btn = chrome_browser.find_element(By.XPATH, '//button[@class="login-btn ant-btn ant-btn-primary"]')
    chrome_browser.execute_script("arguments[0].click();", login_btn)

    time.sleep(2)

    cookies = chrome_browser.get_cookies()
    cookie_dict = {}
    for item in cookies:
        cookie_dict[item["name"]] = item["value"]

    chrome_browser.close()
    return cookie_dict


if __name__ == '__main__':
    # 指定chromedriver的位置
    service = Service("D:/chrome/chromedriver-win64/chromedriver-win64/chromedriver.exe")
    options = Options()
    # 指定chrome浏览器二进制文件的位置
    options.binary_location = "D:/chrome/Google/Chrome/Application/chrome.exe"

    chrome_browser = webdriver.Chrome(options=options, service=service)

    username = "xxx"
    password = "xxx"
    cookie_dict = login(username, password, chrome_browser)

反爬/反反爬

1.概念

反爬：使用某种技术手段，阻止批量获取数据
反反爬：使用某种技术手段，绕过对方设置的反爬策略

2.反爬及反反爬策略

user-agent字段

反爬：通过识别HTTP请求头的user-agent字段值，比如说在nginx中可以配置反爬策略：

# 禁止Scrapy/Curl/HttpClient/python等工具爬取
if ($http_user_agent ~= (Scrapy|Curl|HttpClient|python)) {
    return 403;
}

反反爬：在爬虫程序中，需要程序员设置user-agent字段的值。

使用到fake-useragent库

pip install fake-useragent

示例如下：

import requests
from fake_useragent import UserAgent


ua = UserAgent()
headers = {
    "User-Agent": ua.random
}

# send a get request
res = requests.get("http://www.baidu.com", headers=headers)
print(res.text)

IP访问频率限制
1. 反爬：nginx可以统计IP的访问频率，如果某个IP的访问频率过快会被ban
2. 反反爬：通过批量的代理IP绕过反反爬。网络上有很多免费或者收费的代理IP
强制登录才能看到更过信息，比如说京东网
动态网页，大部分数据通过XHR请求获得
前端的JS逻辑加密和混淆
机器学习分析爬虫行为
CSS代码下毒，比如说某个a标签对使用浏览器进行访问的客户进行隐藏，而对爬虫透明。所以访问过该a标签的客户判定为爬虫

标签：name,chrome,基础,爬虫,re,print,import,div
From： https://www.cnblogs.com/xiaocer/p/17981380