Python爬虫初探

标签：Python URL 爬虫 url html urls 初探 new find

准备部分

0x01 爬虫的简介和价值

a. 简介

自动抓取互联网数据的程序，是基础技术之一

b. 价值

快速提取网络中有价值的信息

0x02 爬虫的开发环境

a. 环境清单

Python3.7
开发环境：Mac、Windows、Linux
编辑器：Pycharm
网页下载：requests(2.21.0)
网页解析：BeautifulSoup/bs4(4.11.2)
动态网页下载：Selenium(3.141.0)

b. 环境测试

新建一个 Python 软件包，命名为 test
在上述软件包中新建一个 Python 文件，命名为 test_env

测试代码如下

import requests
from bs4 import BeautifulSoup
import selenium

print("OK!")

如果成功输入OK!则说明测模块安装成功

基础部分

0x03 简单的爬虫架构和执行流程

爬虫调度端（启动、停止）
爬虫架构（三大模块）
graph LR A(URL 管理器)--URL-->B(网页下载器) B--HTML-->C(网页解析器) C-.URL.->A
1. URL 管理器
  
  URL 对管理，防止重复爬取
2. 网页下载器
  
  网页内容下载
3. 网页解析器
  
  提取价值数据，提取新的待爬 URL
价值数据

0x04 URL 管理器

a. 介绍

作用：对爬取的 URL 进行管理，防止重复和循环爬取
对外接口
- 取出一个待爬取的 URL
- 新增待爬取的 URL
实现逻辑
- 取出时状态变成已爬取
- 新增时判断是否已存在
数据存储
- Python 内存
  - 待爬取 URL 集合：set
  - 已爬取 URL 集合：set
- redis
  - 待爬取 URL 集合：set
  - 已爬取 URL 集合：set
- MySQL
  - urls(url, is_crawled)

b. 代码实现

新建一个 Python 软件包，命名为 utils
在上述软件包中新建一个 Python 文件，命名为 url_manager

由于需要对外暴露接口，需要封装成类，代码如下：

class UrlManager():
    """
    url 管理器
    """

    # 初始化函数
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    # 新增 URL
    def add_new_url(self, url):
        # 判空
        if url is None or len(url) == 0:
            return
        # 判重
        if url is self.new_urls or url in self.old_urls:
            return
        # 添加
        self.new_urls.add(url)

    # 批量添加 URL
    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    # 获取一个新的待爬取 URL
    def get_url(self):
        if self.has_new_url():
            url = self.new_urls.pop()
            self.old_urls.add(url)
            return url
        else:
            return None

    # 判断是否有新的待爬取的 URL
    def has_new_url(self):
        return len(self.new_urls) > 0


# 测试代码
if __name__ == "__main__":
    url_manager = UrlManager()

    # URL 添加测试
    url_manager.add_new_url("url1")
    url_manager.add_new_urls(["url1", "url2"])
    print(url_manager.new_urls, url_manager.old_urls)

    # URL 获取测试
    print("=" * 20)   # 分割线
    new_url = url_manager.get_url()
    print(url_manager.new_urls, url_manager.old_urls)

    print("=" * 20)
    new_url = url_manager.get_url()
    print(url_manager.new_urls, url_manager.old_urls)

    print("=" * 20)
    print(url_manager.has_new_url())

0x05 网页下载器(requests)

a. 介绍

网址：python-requests
安装：pip install requests
介绍：

Requests is an elegant and simple HTTP library for Python, built for human beings.

Requests 是一个优雅的、简单的 Python HTTP 库，常常用于爬虫中对网页内容的下载
执行流程
graph LR A(Python程序<br/>requests 库)--request-->B(网页服务器) B--respone-->A

b. 发送 request 请求

request.get/post(url, params, data, headers, timeout, verify, allow_redirects, cookies)

url：要下载的目标网页的 URL
params：字典形式，设置 URL 后面的参数，如：?id=123&name=xxx
data：字典或者字符串，一般用于使用 POST 方法时提交数据
headers：设置user-agent、refer等请求头
timeout：超时时间，单位是秒
verify：布尔值，是否进行 HTTPS 证书认证，默认 True，需要自己设置证书地址
allow_redirects：布尔值，是否让 requests 做重定向处理，默认 True
cookies：附带本地的 cookies 数据

url、data、headers、timeout为常用参数

c. 接收 response 响应

res = requests.get/post(url)

res.status_code：查看状态码
res.encoding：查看当前编码以及变更编码

（requests 会根据请求头推测编码，推测失败则采用ISO-8859-1进行编码）
res.text：查看返回的网页内容
res.headers：查看返回的 HTTP 的 Headers
res.url：查看实际访问的 URL
res.content：以字节的方式返回内容，比如下载图片时
res.cookies：服务端要写入本地的 cookies 数据

d. 使用演示

在 cmd 中安装 ipython，命令为：python -m pip install ipython

在 cmd 中启动 ipython，命令为：ipython

In [1]: import requests
In [2]: url = "https://www.cnblogs.com/SRIGT"
In [3]: res = requests.get(url)
In [4]: res.status_code
Out[4]: 200
In [5]: res.encoding
Out[5]: 'utf-8'
In [6]: res.url
Out[6]: 'https://www.cnblogs.com/SRIGT'

0x06 网页解析器(BeautifulSoup)

a. 介绍

网址：Beautiful Soup: We called him Tortoise because he taught us.
安装：pip install beautifulsoup4
介绍：Python 第三方库，用于从 HTML 中提取数据
使用：import bs4或from bs4 import BeautifulSoup

b. 语法

graph LR HTML网页-->A(创建 BeautifulSoup 对象) A-->B(搜索节点<br/>find_all, find) B-.->B1(按节点名称) B-.->B2(按节点属性值) B-.->B3(按节点文字) B-->C(访问节点<br/>名称, 属性, 文字)

创建 BeautifulSoup 对象

from bs4 import BeautifulSoup

# 根据 HTML 网页字符串创建 BeautifulSoup 对象
soup = BeautifulSoup(
    html_doc,				# HTML 文档字符串
    'html.parser',			 # HTML 解析器
    from_encoding = 'utf-8'	 # HTML 文档的编码
)

搜索节点

# find_all(name, attrs, string)
# 查找所有标签为 a 的节点
soup.find_all('a')

# 查找所有标签为 a，链接符合 /xxx/index.html 形式的节点
soup.find_all('a', href='/xxx/index.html')

# 查找所有标签为 div，class 为 abc，文字为 python 的节点
soup.find_all('div', class_='abc', string='python')

访问节点信息

# 得到节点： <a href='1.html'>Python</a>
# 获取查找到的节点的标签名称
node.name
# 获取查找到的 a 节点的 href 属性
node['href']
# 获取查找到的 a 节点的链接文字
node.get_text()

c. 使用演示

目标网页

<html>
    <head>
        <meta charset="utf-8">
        <title>页面标题</title>
    </head>
    <body>
        <h1>标题一</h1>
        <h2>标题二</h2>
        <h3>标题一</h3>
        <h4>标题一</h4>
        <div id="content" class="default">
            <p>段落</p>
            <a href="http://www.baidu.com">百度</a>
            <a href="http://www.cnblogs.com/SRIGT">我的博客</a>
        </div>
    </body>
</html>

测试代码

from bs4 import BeautifulSoup

with open("./test.html", 'r', encoding='utf-8') as fin:
    html_doc = fin.read()

soup = BeautifulSoup(html_doc, "html.parser")
div_node = soup.find("div", id="content")
print(div_node)
print()

links = div_node.find_all("a")
for link in links:
    print(link.name, link["href"], link.get_text())

img = div_node.find("img")
print(img["src"])

实战部分

0x07 简单案例

url = "http://www.crazyant.net/"

import requests
r = requests.get(url)
if r.status_code != 200:
    raise Exception()

html_doc = r.text

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")

h2_nodes = soup.find_all("h2", class_="entry-title")

for h2_node in h2_nodes:
    link = h2_node.find("a")
    print(link["href"], link.get_text())

0x08 爬取所有博客页面

根域名：蚂蚁学Python

文章页 URL 形式：PyCharm开发PySpark程序的配置和实例 – 蚂蚁学Python

requests 请求时附带 cookie 字典

import requests
cookies = {
    "captchaKey": "14a54079a1",
    "captchaExpire": "1548852352"
}
r = requests.get(
    "http://url",
    cookies = cookies
)

正则表达式实现模糊匹配

url1 = "http://www.crazyant.net/123.html"
url2 = "http://www.crazyant.net/123.html#comments"
url3 = "http://www.baidu.com"

import re
pattern = r'^http://www.crazyant.net/\d+.html$'

print(re.match(pattern, url1))
print(re.match(pattern, url2))
print(re.match(pattern, url3))

全页面爬取

from utils import url_manager
from bs4 import BeautifulSoup
import requests
import re

root_url = "http://www.crazyant.net"

urls = url_manager.UrlManager()
urls.add_new_url(root_url)

file = open("craw_all_pages.txt", "w")
while urls.has_new_url():
    curr_url = urls.get_url()
    r = requests.get(curr_url, timeout=3)
    if r.status_code != 200:
        print("error, return status_code is not 200", curr_url)
        continue
    soup = BeautifulSoup(r.text, "html.parser")
    title = soup.title.string

    file.write("%s\t%s\n" % (curr_url, title))
    file.flush()
    print("success: %s, %s, %d" % (curr_url, title, len(urls.new_urls)))

    links = soup.find_all("a")
    for link in links:
        href = link.get("href")
        if href is None:
            continue
        pattern = r'^http://www.crazyant.net/\d+.html$'
        if re.match(pattern, href):
            urls.add_new_url(href)

file.close()

0x09 爬取豆瓣电影Top250

❗目前该榜单设置了反爬❗

步骤：

使用 requests 爬取网页

使用 BeautifulSoup 实现数据解析

借助 pandas 将数据写到 Excel

调用

import requests
from bs4 import BeautifulSoup
import pandas as pd

下载共 10 个页面的 HTML

# 构造分页数字列表
page_indexs = range(0, 250, 25)
list(page_indexs)

def download_all_htmls():
    """
    下载所有列表页面的 HTML，用于后续的分析
    """
    htmls = []
    for idx in page_indexs:
        url = f"https://movie.douban.com/top250?start={idx}&filter="
        print("craw html: ", url)
        r = requests.get(url)
        if r.status_code != 200:
            raise Exception("error")
        htmls.append(r.text)
    return htmls

# 执行爬取
htmls = download_all_htmls()

解析 HTML 得到数据

def parse_single_html(html):
    """
    解析单个 HTML，得到数据
    @return list({"link", "title", [label]})
    """
    soup = BeautifulSoup(html, 'html.parser')
    article_items = (
        soup.find("div", class_="article")
            .find("ol", class_="grid_view")
            .find_all("div", class_="item")
    )
    datas = []
    for article_item in article_items:
        rank = article_item.find("div", class_="pic").find("em").get_text()
        info = article_item.find("div", class_="info")
        title = info.find("div", class_="hd").find("span", class_="title").get_text()
        stars = (
            info.find("div", class_="bd")
                .find("div", class_="star")
                .find_all("span")
        )
        rating_star = stars[0]["class"][0]
        rating_num = stars[1].get_text()
        comments = stars[3].get_text()

        datas.append({
            "rank": rank,
            "title": title,
            "rating_star": rating_star.replace("rating", "").replace("-t", ""),
            "rating_num": rating_num,
            "comments": comments.replace("人评价", "")
        })
    return datas


pprint.pprint(parse_single_html(htmls[0]))

all_datas = []
for html in htmls:
    all_datas.extend(parse_single_html(html))

print(all_datas)

将结果存入 Excel

df = pd.DataFrame(all_datas)
df.to_excel("TOP250.xlsx")

End

标签：Python,URL,爬虫,url,html,urls,初探,new,find
From： https://www.cnblogs.com/SRIGT/p/17198841.html

0x01 爬虫的简介和价值

a. 简介

b. 价值

0x02 爬虫的开发环境

a. 环境清单

b. 环境测试

0x03 简单的爬虫架构和执行流程

0x04 URL 管理器

a. 介绍

b. 代码实现

0x05 网页下载器(requests)

a. 介绍

b. 发送 request 请求

c. 接收 response 响应

d. 使用演示

0x06 网页解析器(BeautifulSoup)

a. 介绍

b. 语法

c. 使用演示

0x07 简单案例

0x08 爬取所有博客页面

0x09 爬取豆瓣电影Top250

相关文章

赞助商

阅读排行