首页 > 其他分享 >一步一步学爬虫(3)网页解析之Beautiful Soup的使用

一步一步学爬虫(3)网页解析之Beautiful Soup的使用

时间:2023-01-05 14:38:25浏览次数:41  
标签:Beautiful Bar soup 一步 BeautifulSoup Soup print Foo find

(一步一步学爬虫(3)网页解析之Beautiful Soup的使用)

3.2 网页解析之Beautiful Soup的使用

3.2.1 Beautiful Soup的简介

  • 一种简单的处理导航、搜索、修改、解析功能的工具库。部分替代正则表达式,能自动将输入的文档转换为Unicode编码,将输出文档转换为utf-8编码,大大提高了解析网页的效率。

3.2.2 解析器

  • Beautiful Soup支持的解析器
解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库执行速度适中文档容错能力强 Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml") 速度快唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, "html5lib") 最好的容错性以浏览器的方式解析文档生成HTML5格式的文档 速度慢不依赖外部扩展
  • 推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定。举例如下:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup('<p>Hello</p>', "lxml")
    print(soup.p)
    print(soup.p.string)
    print(soup.p.text)
    print(type(soup.p.string))
    print(type(soup.p.text))
    

    运行结果如下:

    <p>Hello</p>
    Hello
    Hello
    <class 'bs4.element.NavigableString'>
    <class 'str'>	
    

3.2.3 准备工作

  安装解析器

pip3 install lxml
pip3 install beautifulsoup4

3.2.4 基本使用

  • 举例
html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a rel="nofollow" href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a rel="nofollow" href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a rel="nofollow" href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

# lxml为解析器
soup = BeautifulSoup(html_doc, "lxml")
# 格式化输出代码
print(soup.prettify())
print(soup.title.string)
  • 首先,调用prettify方法,这个方法解析出来的字符串以标准缩进格式输出。
  • 然后,调用soup.title.string,输出HTML中title节点的文本内容。

3.2.5 节点选择器

  • 还是以上面HTML代码为例。加上几个节点,如下:
print(soup.title)			# 获取title
print(soup.title.string)	# 获取title内容
print(soup.head)			# 获取head
print(soup.p)				# 获取p内容
  • 运行结果:
<title>The Dormouse's story</title>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

3.2.6 提取信息

  • 继续上面代码的提取。
print(soup.title.name)          # 获取标题名字
print(soup.p.attrs)             # 获取属性和属性值
print(soup.p.attrs['name'])     # 获取name属性值
print(soup.p['name'])           # 获取name属性值
print(soup.p['class'])          # 获取class属性值,如果多个class则注意
print(soup.p.string)            # 获取内容
  • 运行结果
title
{'class': ['title'], 'name': 'dromouse'}
dromouse
dromouse
['title']
The Dormouse's story
  • 嵌套选择
# 嵌套选择
print(soup.head.title)          # head头内容
print(type(soup.head.title))    # head头内容类型
print(soup.head.title.string)   # head头内容文本
  • 运行结果
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
  • bs4的返回结果都是bs4.element.Tag类型可以根据节点嵌套选择。

3.2.7 关联选择

  • 先选中某一节点,再以它为基准选子节点、父(祖先)节点、兄弟节点等。

contents方法

html = """
<html>
	<head>
		<title>The Dormouse's story</title>
	</head>
    <body>        
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a rel="nofollow" href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a rel="nofollow" href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a rel="nofollow" href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
	<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
  • 运行结果如下:
['Once upon a time there were three little sisters; and their names were\n            ', <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ',\n            ', <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n            ', <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\n            and they lived at the bottom of a well.\n        ']

children属性

  • 得到的是a里面的子节点列表。继续用children属性。
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
  • 得到如下列表。说明通过tag的 children生成器,可以对tag的子节点进行循环。
<list_iterator object at 0x000002063FB1F280>
0 Once upon a time there were three little sisters; and their names were
            
1 <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 ,
            
3 <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  and
            
5 <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
            and they lived at the bottom of a well.

descendants属性

  • 如果要得到子孙节点,可以调用descendants属性
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
	print(i, child)
  • 运行结果。
<generator object Tag.descendants at 0x0000023DC771F6F0>
0 Once upon a time there were three little sisters; and their names were
            
1 <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 Elsie
3 ,
            
4 <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
5 Lacie
6  and
            
7 <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
8 Tillie
9 ;
            and they lived at the bottom of a well.

parent和parents

  • 调用父节点和祖先节点
print(soup.a.parent)          # 获取父节点
print(soup.a.parents)         # 获取祖先节点
  • 结果。
<p class="story">Once upon a time there were three little sisters; and their names were
            <a rel="nofollow" class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
            <a rel="nofollow" class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
            <a rel="nofollow" class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
<generator object PageElement.parents at 0x000001BE589BF6F0>
  • 返回的结果是生成器类型

兄弟节点next_sibling和previous_sibling

语法如下:

print(soup.a.next_sibling)				# 获取节点的下一个
print(soup.a.previous_sibling)			# 获取节点的上一个
print(soup.a.next_siblings)				# 获取节点的所有下一个
print(soup.a.previous_siblings)		# 获取节点的所有上一个
  • 提取信息
  • 如果返回的是单个节点,可以直接调用string、attrs等属性获得其文本和属性。如果返回的结果包含多个节点的生成器,则可以先将结果转为列表,再从中取出某个元素。如
print(soup.a.next_sibling.string)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

3.2.8 方法选择器

find_all方法

  • 查询所有符合条件的元素。
  • 格式:find_all(name, attrs, recursive, text, **kwargs)
  • name html

find方法

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))
print(soup.find_all(name='ul')[0])
  • 运行结果。
[<ul class="list" id="list-1">
<li class="">Foo</li>
<li class="">Bar</li>
<li class="">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="">Foo</li>
<li class="">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="">Foo</li>
<li class="">Bar</li>
<li class="">Jay</li>
</ul>
  • 调用find_all方法,生成列表类型,每个元素还是bs4.element.Tag类型。再用for遍历子节点,得到<li>节点。
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
  • <li>得到子节点。
[<li class="">Foo</li>, <li class="">Bar</li>, <li class="">Jay</li>]
[<li class="">Foo</li>, <li class="">Bar</li>]
  • 再继续遍历每个li节点,得到它的文本内容。
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)
  • 调用string属性得到<li>内的文本内容。
[<li class="">Foo</li>, <li class="">Bar</li>, <li class="">Jay</li>]
Foo
Bar
Jay
[<li class="">Foo</li>, <li class="">Bar</li>]
Foo
Bar

limit参数限制

soup.find_all("li", limit=1)
  • 限制搜索结果的返回数量,只要第1个。

  • attrs

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
  • 上例说明 ,查询时,使用attrs参数,需要使用字典类型,得到列表形式。如果不用attrs参数,可以直接使用id查询,或者class,或者直接加标签等。

class_参数的特殊标记

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
print(soup.find_all('li'))
  • find
html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(soup.find(class_="element"))
print(soup.find('li'))
  • find方法返回的只是第一条查询到的内容。
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<li class="element">Foo</li>
<li class="element">Foo</li>
  • text
import re 
html='''
<div class="panel">
	<div class="panel-body">
		<a>Hello,this is a link</a>
		<a>Hello,this is a link , too</a>
	</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))
  • 运行结果如下,得到一个列表。
['Hello,this is a link', 'Hello,this is a link , too']

get()方法

还有一个重要的方法,那就是get()方法。用法如下:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.find_all("ul"):
    print(li.get("class"))
  • 运行结果如下,返回所有包含class标签的列表,同样,此方法也可以获取诸如herf等地址标签,非常实用。
['list']
['list', 'list-small']

3.2.9 CSS选择器

select方法

  • CSS选择器,只需要调用select方法,传入相应的CSS选择器即可。
html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('li')[0])
print(type(soup.select('ul')[0]))
  • 运行结果:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<li class="element">Foo</li>
<class 'bs4.element.Tag'>
  • 同样支持嵌套和获取属性。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))
    print(ul['id'])
    print(ul.attrs['id'])
  • 运行结果如下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
list-1
list-1
[<li class="element">Foo</li>, <li class="element">Bar</li>]
list-2
list-2

get_text()方法

  • 获取文本,除了string属性,还有一个方法是get_text()。示例如下:
for ul in soup.select('ul'):
    for li in ul.select('li'):
        print('string:'+li.string)
        print('get_text:'+li.get_text())
  • 运行结果。
string:Foo
get_text:Foo
string:Bar
get_text:Bar
string:Jay
get_text:Jay
string:Foo
get_text:Foo
string:Bar
get_text:Bar

标签:Beautiful,Bar,soup,一步,BeautifulSoup,Soup,print,Foo,find
From: https://blog.51cto.com/u_15930659/5991042

相关文章