python教程---网络爬虫

标签：Beautiful 匹配 Soup python text 爬虫 --- re html

7.1 urllib

Urllib是Python内置的一个用于读取来自Web的数据的库。它是一个请求库，可以用来发送HTTP请求，获取网页内容，支持多种HTTP方法，如GET和POST等。
使用Urllib读取网页内容的步骤如下：

导入Urllib库中的request模块。

import urllib.request

使用urlopen()函数发送HTTP请求，获取网页内容。

response = urllib.request.urlopen('http://www.example.com')

读取获取到的内容。可以使用read()、readline()和readlines()方法。

html = response.read()

对获取到的内容进行解码，以便得到字符串形式的内容。

html = html.decode('utf-8')

关闭响应对象。

response.close()

示例：

import urllib.request
url = 'http://www.example.com'
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode('utf-8')
print(html)
response.close()

以上代码使用Urllib读取了http://www.example.com网站的内容，并将其打印出来。

7.2 正则表达式

正则表达式（Regular Expression，简称RegEx）是一种用于匹配字符串中字符组合的模式。在Python中，re模块提供了正则表达式的支持。正则表达式在网络爬虫中常用于解析网页内容，提取需要的数据。
使用正则表达式的基本步骤如下：

导入re模块。

import re

编写正则表达式模式。正则表达式的语法规则包括字符匹配、量词、分组等。
使用re模块提供的方法进行匹配。常见的方法有：
- re.search(pattern, string): 在字符串中搜索模式，返回第一个匹配项的匹配对象。
- re.match(pattern, string): 从字符串的起始位置匹配模式，返回匹配对象。
- re.findall(pattern, string): 在字符串中找到所有匹配项，返回一个列表。
- re.finditer(pattern, string): 在字符串中找到所有匹配项，返回一个迭代器。
- re.sub(pattern, repl, string): 替换字符串中所有匹配的子串。
  示例：

import re
# 示例文本
text = "Hello, my phone number is 123-456-7890."
# 正则表达式模式，用于匹配电话号码
pattern = r'\d{3}-\d{3}-\d{4}'
# 使用re.search()查找匹配项
match = re.search(pattern, text)
# 如果找到匹配项，则输出
if match:
    print("Found phone number:", match.group())
else:
    print("No phone number found.")
# 使用re.findall()查找所有匹配项
phone_numbers = re.findall(pattern, text)
print("Phone numbers found:", phone_numbers)

输出：

Found phone number: 123-456-7890
Phone numbers found: ['123-456-7890']

在这个例子中，我们使用正则表达式\d{3}-\d{3}-\d{4}来匹配格式为XXX-XXX-XXXX的电话号码。re.search()用于找到第一个匹配项，而re.findall()用于找到所有匹配项。

7.3 Beautiful Soup
Beautiful Soup 是一个 Python 库，用于从 HTML 或 XML 文件中提取数据。它可以帮助我们解析网页内容，方便地提取出我们需要的数据。Beautiful Soup 与 lxml、html5lib 等解析器一起工作，提供了丰富的解析方法。
使用 Beautiful Soup 的基本步骤如下：

安装 Beautiful Soup 库。如果还没有安装，可以使用 pip 进行安装：

pip install beautifulsoup4

导入 Beautiful Soup 模块。

from bs4 import BeautifulSoup

加载 HTML 内容到 Beautiful Soup 对象。

soup = BeautifulSoup(html_content, 'html.parser')

其中 html_content 是你要解析的 HTML 内容，'html.parser' 是解析器，这里使用的是 Python 内置的 HTML 解析器。
4. 使用 Beautiful Soup 提供的方法提取数据。常见的方法有：

soup.find(): 查找第一个匹配的标签。
soup.find_all(): 查找所有匹配的标签。
soup.select(): 使用 CSS 选择器查找标签。
tag.get_text(): 获取标签内的文本内容。
示例：

from bs4 import BeautifulSoup
# 示例 HTML 内容
html_content = """
<html>
<head>
<title>Example Web Page</title>
</head>
<body>
<h1>Welcome to Example Web Page</h1>
<p>This is a paragraph with some text.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>
</body>
</html>
"""
# 加载 HTML 内容到 Beautiful Soup 对象
soup = BeautifulSoup(html_content, 'html.parser')
# 提取标题文本
title = soup.find('title').get_text()
print("Title:", title)
# 提取所有的段落文本
paragraphs = soup.find_all('p')
for p in paragraphs:
    print("Paragraph:", p.get_text())
# 使用 CSS 选择器提取无序列表中的所有列表项
list_items = soup.select('ul li')
for item in list_items:
    print("List item:", item.get_text())

输出：

Title: Example Web Page
Paragraph: This is a paragraph with some text.
List item: Item 1
List item: Item 2
List item: Item 3

在这个例子中，我们使用 Beautiful Soup 来解析一个简单的 HTML 页面，提取了标题、段落文本以及无序列表中的列表项。Beautiful Soup 提供了丰富的 API 来方便地操作和提取网页内容。

标签：Beautiful,匹配,Soup,python,text,爬虫,---,re,html
From： https://blog.csdn.net/qq_44624290/article/details/140097793

7.1 urllib

7.2 正则表达式

相关文章

赞助商

阅读排行