标签：Beautiful Bar soup 一步 BeautifulSoup Soup print Foo find

(一步一步学爬虫（3）网页解析之Beautiful Soup的使用)

3.2 网页解析之Beautiful Soup的使用

3.2.1 Beautiful Soup的简介

一种简单的处理导航、搜索、修改、解析功能的工具库。部分替代正则表达式，能自动将输入的文档转换为Unicode编码，将输出文档转换为utf-8编码，大大提高了解析网页的效率。

3.2.2 解析器

Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定。举例如下：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', "lxml")
print(soup.p)
print(soup.p.string)
print(soup.p.text)
print(type(soup.p.string))
print(type(soup.p.text))

运行结果如下：

<p>Hello</p>
Hello
Hello
<class 'bs4.element.NavigableString'>
<class 'str'>

3.2.3 准备工作

安装解析器

pip3 install lxml
pip3 install beautifulsoup4

3.2.4 基本使用

举例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a rel="nofollow" href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a rel="nofollow" href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a rel="nofollow" href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

# lxml为解析器
soup = BeautifulSoup(html_doc, "lxml")
# 格式化输出代码
print(soup.prettify())
print(soup.title.string)

首先，调用prettify方法，这个方法解析出来的字符串以标准缩进格式输出。
然后，调用soup.title.string，输出HTML中title节点的文本内容。

3.2.5 节点选择器

还是以上面HTML代码为例。加上几个节点，如下：

print(soup.title)			# 获取title
print(soup.title.string)	# 获取title内容
print(soup.head)			# 获取head
print(soup.p)				# 获取p内容

运行结果：

<title>The Dormouse's story</title>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

3.2.6 提取信息

继续上面代码的提取。

print(soup.title.name)          # 获取标题名字
print(soup.p.attrs)             # 获取属性和属性值
print(soup.p.attrs['name'])     # 获取name属性值
print(soup.p['name'])           # 获取name属性值
print(soup.p['class'])          # 获取class属性值，如果多个class则注意
print(soup.p.string)            # 获取内容

运行结果

title
{'class': ['title'], 'name': 'dromouse'}
dromouse
dromouse
['title']
The Dormouse's story

嵌套选择

# 嵌套选择
print(soup.head.title)          # head头内容
print(type(soup.head.title))    # head头内容类型
print(soup.head.title.string)   # head头内容文本

运行结果

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

bs4的返回结果都是bs4.element.Tag类型可以根据节点嵌套选择。

3.2.7 关联选择

先选中某一节点，再以它为基准选子节点、父（祖先）节点、兄弟节点等。

contents方法

html = """
<html>
	<head>
		<title>The Dormouse's story</title>
	</head>
    <body>        
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a rel="nofollow" href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a rel="nofollow" href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a rel="nofollow" href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
	<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

运行结果如下：

['Once upon a time there were three little sisters; and their names were\n            ', <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ',\n            ', <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n            ', <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\n            and they lived at the bottom of a well.\n        ']

children属性

得到的是a里面的子节点列表。继续用children属性。

print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

得到如下列表。说明通过tag的 children生成器,可以对tag的子节点进行循环。

<list_iterator object at 0x000002063FB1F280>
0 Once upon a time there were three little sisters; and their names were
            
1 <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 ,
            
3 <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  and
            
5 <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
            and they lived at the bottom of a well.

descendants属性

如果要得到子孙节点，可以调用descendants属性

print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
	print(i, child)

运行结果。

<generator object Tag.descendants at 0x0000023DC771F6F0>
0 Once upon a time there were three little sisters; and their names were
            
1 <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 Elsie
3 ,
            
4 <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
5 Lacie
6  and
            
7 <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
8 Tillie
9 ;
            and they lived at the bottom of a well.

parent和parents

调用父节点和祖先节点

print(soup.a.parent)          # 获取父节点
print(soup.a.parents)         # 获取祖先节点

结果。

<p class="story">Once upon a time there were three little sisters; and their names were
            <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
            <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
            <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
<generator object PageElement.parents at 0x000001BE589BF6F0>

返回的结果是生成器类型

兄弟节点next_sibling和previous_sibling

语法如下：

print(soup.a.next_sibling)				# 获取节点的下一个
print(soup.a.previous_sibling)			# 获取节点的上一个
print(soup.a.next_siblings)				# 获取节点的所有下一个
print(soup.a.previous_siblings)		# 获取节点的所有上一个

提取信息
如果返回的是单个节点，可以直接调用string、attrs等属性获得其文本和属性。如果返回的结果包含多个节点的生成器，则可以先将结果转为列表，再从中取出某个元素。如

print(soup.a.next_sibling.string)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

3.2.8 方法选择器

find_all方法

查询所有符合条件的元素。
格式：find_all(name, attrs, recursive, text, **kwargs)
name html

find方法

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))
print(soup.find_all(name='ul')[0])

运行结果。

[<ul class="list" id="list-1">
<li class="">Foo</li>
<li class="">Bar</li>
<li class="">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="">Foo</li>
<li class="">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="">Foo</li>
<li class="">Bar</li>
<li class="">Jay</li>
</ul>

调用find_all方法，生成列表类型，每个元素还是bs4.element.Tag类型。再用for遍历子节点，得到<li>节点。

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))

<li>得到子节点。

[<li class="">Foo</li>, <li class="">Bar</li>, <li class="">Jay</li>]
[<li class="">Foo</li>, <li class="">Bar</li>]

再继续遍历每个li节点，得到它的文本内容。

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

调用string属性得到<li>内的文本内容。

[<li class="">Foo</li>, <li class="">Bar</li>, <li class="">Jay</li>]
Foo
Bar
Jay
[<li class="">Foo</li>, <li class="">Bar</li>]
Foo
Bar

limit参数限制

soup.find_all("li", limit=1)

限制搜索结果的返回数量，只要第1个。
attrs

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id':'list-1'}))

上例说明，查询时，使用attrs参数，需要使用字典类型，得到列表形式。如果不用attrs参数，可以直接使用id查询，或者class，或者直接加标签等。

class_参数的特殊标记

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
print(soup.find_all('li'))

find

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(soup.find(class_="element"))
print(soup.find('li'))

find方法返回的只是第一条查询到的内容。

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<li class="element">Foo</li>
<li class="element">Foo</li>

text

import re 
html='''
<div class="panel">
	<div class="panel-body">
		<a>Hello,this is a link</a>
		<a>Hello,this is a link , too</a>
	</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))

运行结果如下，得到一个列表。

['Hello,this is a link', 'Hello,this is a link , too']

get()方法

还有一个重要的方法，那就是get()方法。用法如下：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.find_all("ul"):
    print(li.get("class"))

运行结果如下，返回所有包含class标签的列表，同样，此方法也可以获取诸如herf等地址标签，非常实用。

['list']
['list', 'list-small']

3.2.9 CSS选择器

select方法

CSS选择器，只需要调用select方法，传入相应的CSS选择器即可。

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('li')[0])
print(type(soup.select('ul')[0]))

运行结果：

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<li class="element">Foo</li>
<class 'bs4.element.Tag'>

同样支持嵌套和获取属性。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))
    print(ul['id'])
    print(ul.attrs['id'])

运行结果如下：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
list-1
list-1
[<li class="element">Foo</li>, <li class="element">Bar</li>]
list-2
list-2

get_text()方法

获取文本，除了string属性，还有一个方法是get_text()。示例如下：

for ul in soup.select('ul'):
    for li in ul.select('li'):
        print('string:'+li.string)
        print('get_text:'+li.get_text())

运行结果。

string:Foo
get_text:Foo
string:Bar
get_text:Bar
string:Jay
get_text:Jay
string:Foo
get_text:Foo
string:Bar
get_text:Bar

标签：Beautiful,Bar,soup,一步,BeautifulSoup,Soup,print,Foo,find
From： https://blog.51cto.com/u_15930659/5991042

一步一步学爬虫（3）网页解析之Beautiful Soup的使用