(一步一步学爬虫(3)网页解析之Beautiful Soup的使用)
3.2 网页解析之Beautiful Soup的使用
3.2.1 Beautiful Soup的简介
- 一种简单的处理导航、搜索、修改、解析功能的工具库。部分替代正则表达式,能自动将输入的文档转换为Unicode编码,将输出文档转换为utf-8编码,大大提高了解析网页的效率。
3.2.2 解析器
- Beautiful Soup支持的解析器
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") |
Python的内置标准库执行速度适中文档容错能力强 | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
速度快文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml") |
速度快唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, "html5lib") |
最好的容错性以浏览器的方式解析文档生成HTML5格式的文档 | 速度慢不依赖外部扩展 |
-
推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定。举例如下:
from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello</p>', "lxml") print(soup.p) print(soup.p.string) print(soup.p.text) print(type(soup.p.string)) print(type(soup.p.text))
运行结果如下:
<p>Hello</p> Hello Hello <class 'bs4.element.NavigableString'> <class 'str'>
3.2.3 准备工作
安装解析器
pip3 install lxml
pip3 install beautifulsoup4
3.2.4 基本使用
- 举例
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a rel="nofollow" href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a rel="nofollow" href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a rel="nofollow" href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
# lxml为解析器
soup = BeautifulSoup(html_doc, "lxml")
# 格式化输出代码
print(soup.prettify())
print(soup.title.string)
- 首先,调用prettify方法,这个方法解析出来的字符串以标准缩进格式输出。
- 然后,调用soup.title.string,输出HTML中title节点的文本内容。
3.2.5 节点选择器
- 还是以上面HTML代码为例。加上几个节点,如下:
print(soup.title) # 获取title
print(soup.title.string) # 获取title内容
print(soup.head) # 获取head
print(soup.p) # 获取p内容
- 运行结果:
<title>The Dormouse's story</title>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
3.2.6 提取信息
- 继续上面代码的提取。
print(soup.title.name) # 获取标题名字
print(soup.p.attrs) # 获取属性和属性值
print(soup.p.attrs['name']) # 获取name属性值
print(soup.p['name']) # 获取name属性值
print(soup.p['class']) # 获取class属性值,如果多个class则注意
print(soup.p.string) # 获取内容
- 运行结果
title
{'class': ['title'], 'name': 'dromouse'}
dromouse
dromouse
['title']
The Dormouse's story
- 嵌套选择
# 嵌套选择
print(soup.head.title) # head头内容
print(type(soup.head.title)) # head头内容类型
print(soup.head.title.string) # head头内容文本
- 运行结果
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
- bs4的返回结果都是bs4.element.Tag类型可以根据节点嵌套选择。
3.2.7 关联选择
- 先选中某一节点,再以它为基准选子节点、父(祖先)节点、兄弟节点等。
contents方法
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a rel="nofollow" href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a rel="nofollow" href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a rel="nofollow" href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
- 运行结果如下:
['Once upon a time there were three little sisters; and their names were\n ', <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ',\n ', <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n ', <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\n and they lived at the bottom of a well.\n ']
children属性
- 得到的是a里面的子节点列表。继续用children属性。
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
- 得到如下列表。说明通过tag的 children生成器,可以对tag的子节点进行循环。
<list_iterator object at 0x000002063FB1F280>
0 Once upon a time there were three little sisters; and their names were
1 <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 ,
3 <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 and
5 <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
and they lived at the bottom of a well.
descendants属性
- 如果要得到子孙节点,可以调用descendants属性
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
- 运行结果。
<generator object Tag.descendants at 0x0000023DC771F6F0>
0 Once upon a time there were three little sisters; and their names were
1 <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 Elsie
3 ,
4 <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
5 Lacie
6 and
7 <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
8 Tillie
9 ;
and they lived at the bottom of a well.
parent和parents
- 调用父节点和祖先节点
print(soup.a.parent) # 获取父节点
print(soup.a.parents) # 获取祖先节点
- 结果。
<p class="story">Once upon a time there were three little sisters; and their names were
<a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<generator object PageElement.parents at 0x000001BE589BF6F0>
- 返回的结果是生成器类型
兄弟节点next_sibling和previous_sibling
语法如下:
print(soup.a.next_sibling) # 获取节点的下一个
print(soup.a.previous_sibling) # 获取节点的上一个
print(soup.a.next_siblings) # 获取节点的所有下一个
print(soup.a.previous_siblings) # 获取节点的所有上一个
- 提取信息
- 如果返回的是单个节点,可以直接调用string、attrs等属性获得其文本和属性。如果返回的结果包含多个节点的生成器,则可以先将结果转为列表,再从中取出某个元素。如
print(soup.a.next_sibling.string)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])
3.2.8 方法选择器
find_all方法
- 查询所有符合条件的元素。
- 格式:find_all(name, attrs, recursive, text, **kwargs)
- name html
find方法
html = """
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))
print(soup.find_all(name='ul')[0])
- 运行结果。
[<ul class="list" id="list-1">
<li class="">Foo</li>
<li class="">Bar</li>
<li class="">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="">Foo</li>
<li class="">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="">Foo</li>
<li class="">Bar</li>
<li class="">Jay</li>
</ul>
- 调用find_all方法,生成列表类型,每个元素还是bs4.element.Tag类型。再用for遍历子节点,得到<li>节点。
for ul in soup.find_all(name='ul'):
print(ul.find_all(name='li'))
- <li>得到子节点。
[<li class="">Foo</li>, <li class="">Bar</li>, <li class="">Jay</li>]
[<li class="">Foo</li>, <li class="">Bar</li>]
- 再继续遍历每个li节点,得到它的文本内容。
for ul in soup.find_all(name='ul'):
print(ul.find_all(name='li'))
for li in ul.find_all(name='li'):
print(li.string)
- 调用string属性得到<li>内的文本内容。
[<li class="">Foo</li>, <li class="">Bar</li>, <li class="">Jay</li>]
Foo
Bar
Jay
[<li class="">Foo</li>, <li class="">Bar</li>]
Foo
Bar
limit参数限制
soup.find_all("li", limit=1)
-
限制搜索结果的返回数量,只要第1个。
-
attrs
html = """
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
- 上例说明 ,查询时,使用attrs参数,需要使用字典类型,得到列表形式。如果不用attrs参数,可以直接使用id查询,或者class,或者直接加标签等。
class_参数的特殊标记
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
print(soup.find_all('li'))
- find
html = """
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(soup.find(class_="element"))
print(soup.find('li'))
- find方法返回的只是第一条查询到的内容。
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<li class="element">Foo</li>
<li class="element">Foo</li>
- text
import re
html='''
<div class="panel">
<div class="panel-body">
<a>Hello,this is a link</a>
<a>Hello,this is a link , too</a>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))
- 运行结果如下,得到一个列表。
['Hello,this is a link', 'Hello,this is a link , too']
get()方法
还有一个重要的方法,那就是get()方法。用法如下:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.find_all("ul"):
print(li.get("class"))
- 运行结果如下,返回所有包含class标签的列表,同样,此方法也可以获取诸如herf等地址标签,非常实用。
['list']
['list', 'list-small']
3.2.9 CSS选择器
select方法
- CSS选择器,只需要调用select方法,传入相应的CSS选择器即可。
html = """
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('li')[0])
print(type(soup.select('ul')[0]))
- 运行结果:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<li class="element">Foo</li>
<class 'bs4.element.Tag'>
- 同样支持嵌套和获取属性。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
print(ul['id'])
print(ul.attrs['id'])
- 运行结果如下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
list-1
list-1
[<li class="element">Foo</li>, <li class="element">Bar</li>]
list-2
list-2
get_text()方法
- 获取文本,除了string属性,还有一个方法是get_text()。示例如下:
for ul in soup.select('ul'):
for li in ul.select('li'):
print('string:'+li.string)
print('get_text:'+li.get_text())
- 运行结果。
string:Foo
get_text:Foo
string:Bar
get_text:Bar
string:Jay
get_text:Jay
string:Foo
get_text:Foo
string:Bar
get_text:Bar
标签:Beautiful,Bar,soup,一步,BeautifulSoup,Soup,print,Foo,find
From: https://blog.51cto.com/u_15930659/5991042