Python爬虫实战：从零开始制作一个网络爬虫

标签：titles title Python text 爬虫 BeautifulSoup url 从零开始

Python爬虫实战：从零开始制作一个网络爬虫_HTML

网络爬虫，又称网页蜘蛛、网页抓取器等，是一种从互联网上自动抓取网页数据的程序。Python 是编写网络爬虫的最佳语言，因为它具有简洁的语法、丰富的库和强大的社区支持。本文将带你从零开始制作一个 Python 网络爬虫，并详细介绍相关知识和技巧。

1. 什么是网络爬虫

网络爬虫是一个自动抓取网页数据的程序。它通过发送 HTTP 请求来获取网页内容，然后从 HTML 中提取所需的信息。爬虫可以用于搜索引擎、数据挖掘、信息采集等多种场景。

2. 爬虫的基本原理

网络爬虫的工作流程分为四个步骤：

发送请求：向目标网站发送 HTTP 请求，获取网页 HTML 代码。
解析内容：使用 HTML 解析库（如 BeautifulSoup）从网页代码中提取所需数据。
保存数据：将提取到的数据存储到本地或数据库中。
遍历链接：从网页中提取链接，然后重复以上步骤，实现递归抓取。

3. Python 爬虫工具与库

要编写 Python 爬虫，我们需要用到以下库和工具：

requests：一个简洁、易用的 HTTP 请求库，用于发送 HTTP 请求和处理响应。
BeautifulSoup：一个强大的 HTML 解析库，可以轻松提取网页中的数据。
Scrapy：一个功能齐全的爬虫框架，适用于大型项目和数据采集任务。
lxml：一个高效的 XML 和 HTML 解析库，可与 BeautifulSoup 配合使用，提高解析速度。
pandas：一个数据处理库，用于整理、分析和存储数据。

首先安装这些库，可以使用 pip 命令：

pip install requests beautifulsoup4 scrapy lxml pandas

4. 编写一个简单的爬虫

下面我们来编写一个简单的爬虫，以爬取 Python 官网的新闻标题为例。首先，我们需要导入 requests 和 BeautifulSoup 库：

import requests
from bs4 import BeautifulSoup

然后，发送 HTTP 请求，获取网页 HTML 代码：

url = "https://www.python.org/"
response = requests.get(url)
html_content = response.text

接下来，使用 BeautifulSoup 解析 HTML 代码，并提取新闻标题：

soup = BeautifulSoup(html_content, "lxml")
news_titles = soup.select(".shrubbery .list-recent-events li h3")
for title in news_titles:
    print(title.text)

运行这段代码，你将看到 Python 官网的新闻标题列表。

5. 使用 BeautifulSoup 解析网页

BeautifulSoup 是一个用于解析 HTML 和 XML 的库。它可以帮助你从网页代码中提取数据，支持多种解析器（如 lxml、html.parser 等）。

5.1 常用方法

find(tag, attrs)：查找第一个匹配的标签。
find_all(tag, attrs)：查找所有匹配的标签。
select(selector)：使用 CSS 选择器查找匹配的标签。
get_text()：获取标签内的文本。

5.2 示例

假设我们要从以下 HTML 代码中提取书籍标题和价格：

<html>
  <body>
    <div class="book-list">
      <div class="book">
        <h3 class="title">Python Cookbook</h3>
        <span class="price">$39.99</span>
      </div>
      <div class="book">
        <h3 class="title">Fluent Python</h3>
        <span class="price">$29.99</span>
      </div>
    </div>
  </body>
</html>

我们可以使用以下代码：

html = """
# 上面的 HTML 代码
"""
soup = BeautifulSoup(html, "lxml")
books = soup.find_all("div", class_="book")
for book in books:
    title = book.find("h3", class_="title").text
    price = book.find("span", class_="price").text
    print(f"{title} - {price}")

运行结果：

Python Cookbook - $39.99
Fluent Python - $29.99

6. 使用 Scrapy 框架构建爬虫

Scrapy 是一个功能齐全的爬虫框架，适用于大型项目和数据采集任务。它提供了丰富的功能，如数据存储、数据处理、爬虫中间件等。

6.1 安装 Scrapy

pip install scrapy

6.2 创建 Scrapy 项目

scrapy startproject my_project

6.3 编写 Scrapy 爬虫

以下是一个 Scrapy 爬虫的示例，用于爬取豆瓣电影 Top250 的信息：

import scrapy

class DoubanSpider(scrapy.Spider):
    name = "douban"
    start_urls = ["https://movie.douban.com/top250"]
    
    def parse(self, response):
    movies = response.css(".grid_view li")
    for movie in movies:
        title = movie.css(".title::text").get()
        rating = movie.css(".rating_num::text").get()
        url = movie.css(".info a::attr(href)").get()

        yield {
            "title": title,
            "rating": rating,
            "url": url,
        }

    next_page = response.css(".paginator .next a::attr(href)").get()
    if next_page:
        next_url = response.urljoin(next_page)
        yield scrapy.Request(next_url, callback=self.parse)

### 6.4 运行 Scrapy 爬虫

在命令行中输入以下命令：

```bash
scrapy crawl douban

7. 异步爬虫与多线程

在某些情况下，为了提高爬虫效率，我们可以使用异步爬虫或多线程技术。异步爬虫可以同时处理多个请求，而多线程则是通过并行运行来提高速度。

7.1 使用 `asyncio` 和 `aiohttp` 实现异步爬虫

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_page(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

async def get_titles(url):
    html = await fetch_page(url)
    soup = BeautifulSoup(html, "lxml")
    titles = soup.select(".shrubbery .list-recent-events li h3")
    return [title.text for title in titles]

async def main():
    url = "https://www.python.org/"
    titles = await get_titles(url)
    for title in titles:
        print(title)

asyncio.run(main())

7.2 使用 `threading` 实现多线程爬虫

import requests
import threading
from bs4 import BeautifulSoup

def fetch_page(url):
    response = requests.get(url)
    return response.text

def get_titles(url):
    html = fetch_page(url)
    soup = BeautifulSoup(html, "lxml")
    titles = soup.select(".shrubbery .list-recent-events li h3")
    return [title.text for title in titles]

def print_titles(url):
    titles = get_titles(url)
    for title in titles:
        print(title)

urls = ["https://www.python.org/", "https://www.djangoproject.com/"]

threads = []
for url in urls:
    thread = threading.Thread(target=print_titles, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

8. 反爬技巧及应对方法

网站可能会采取反爬策略，如设置 User-Agent、IP 限制、验证码等。我们需要采取相应的应对策略。

设置 User-Agent：为请求添加自定义 User-Agent，模拟浏览器访问。
使用代理 IP：通过代理服务器访问目标网站，规避 IP 限制。
处理验证码：使用 OCR 技术或第三方服务自动识别验证码。

9. 项目实例：爬取豆瓣电影 Top250

在这个实例中，我们将使用 requests 和 BeautifulSoup 爬取豆瓣电影 Top250 的电影名称、评分和链接。

首先，编写一个函数，用于发送请求和解析网页：

import requests
from bs4 import BeautifulSoup

def fetch_page(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    return response.text

def parse_movies(html):
    soup = BeautifulSoup(html, "lxml")
    movies = soup.select(".grid_view li")
    result = []

    for movie in movies:
        title = movie.select_one(".title").text
        rating = movie.select_one(".rating_num").text
        url = movie.select_one(".info a")["href"]

        result.append({
            "title": title,
            "rating": rating,
            "url": url,
        })

    return result

接着，编写一个循环，遍历所有分页，并调用上面的函数：

BASE_URL = "https://movie.douban.com/top250?start="

all_movies = []
for i in range(0, 250, 25):
    url = f"{BASE_URL}{i}"
    html = fetch_page(url)
    movies = parse_movies(html)
    all_movies.extend(movies)

最后，输出爬取到的电影信息：

for movie in all_movies:
    print(f"{movie['title']} - {movie['rating']} - {movie['url']}")

运行这段代码，你将看到豆瓣电影 Top250 的相关信息。

10. 总结

本文从零开始，详细介绍了如何使用 Python 编写网络爬虫。我们学习了爬虫的基本原理、常用库和工具、BeautifulSoup 的用法、Scrapy 框架、异步爬虫与多线程以及反爬技巧等内容。通过实际项目实例，我们掌握了如何从网页中提取数据并保存。希望本文能对你学习 Python 爬虫有所帮助。

标签：titles,title,Python,text,爬虫,BeautifulSoup,url,从零开始
From： https://blog.51cto.com/u_13237322/6148297

Python爬虫实战：从零开始制作一个网络爬虫

1. 什么是网络爬虫

2. 爬虫的基本原理

3. Python 爬虫工具与库

4. 编写一个简单的爬虫

5. 使用 BeautifulSoup 解析网页

5.1 常用方法

5.2 示例

6. 使用 Scrapy 框架构建爬虫

6.1 安装 Scrapy

6.2 创建 Scrapy 项目

6.3 编写 Scrapy 爬虫

7. 异步爬虫与多线程

7.1 使用 `asyncio` 和 `aiohttp` 实现异步爬虫

7.2 使用 `threading` 实现多线程爬虫

8. 反爬技巧及应对方法

9. 项目实例：爬取豆瓣电影 Top250

10. 总结

相关文章

赞助商

阅读排行

Python爬虫实战：从零开始制作一个网络爬虫

1. 什么是网络爬虫

2. 爬虫的基本原理

3. Python 爬虫工具与库

4. 编写一个简单的爬虫

5. 使用 BeautifulSoup 解析网页

5.1 常用方法

5.2 示例

6. 使用 Scrapy 框架构建爬虫

6.1 安装 Scrapy

6.2 创建 Scrapy 项目

6.3 编写 Scrapy 爬虫

7. 异步爬虫与多线程

7.1 使用 asyncio 和 aiohttp 实现异步爬虫

7.2 使用 threading 实现多线程爬虫

8. 反爬技巧及应对方法

9. 项目实例：爬取豆瓣电影 Top250

10. 总结

相关文章

赞助商

阅读排行

7.1 使用 `asyncio` 和 `aiohttp` 实现异步爬虫

7.2 使用 `threading` 实现多线程爬虫