解析库之bs4

标签：bs4 标签 find html bs print 解析 class

bs4.Beautiful

第一个参数字符串类型的html源代码，可以使用urlopen发起请求后使用read（）方法或者requests.get()发起请求后使用text（）获取源码

第二个参数是html解析器，可以选择使用html.parser或者lxml

preffity（）

以标准缩进的形式返回html源码

属性选择

html = '''
<html lang="en">
<head><title>The Dormouse 's story</title></head>
<body>
The Dormouse 's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2"> Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3 ">Tillie</a>;
and they lived at the bottom of a well .

 ... 
</body>
</html>'''
bs = BeautifulSoup(html, 'lxml')
# 返回标准缩进的html代码
print(bs.prettify())
# 选择标签,返回bs4.element.Tag对象，打印会返回当前标签及其内部内容
print(bs.p)
print(bs.a)
# name返回的是标签的名字，返回p
print(bs.p.name)
# 属性选取
print(bs.p.attrs['class'])
print(bs.p['class'])
# 标签内容
print(bs.p.string)
# 嵌套选择
print(bs.head.title)
# 关联选择
# 获取body的所有直接子节点,list存储
print(bs.body.contents)
# children同样返回所有直接子节点，是个生成器
for i, text in enumerate(bs.body.children):
print(text)
# 返回所有子孙节点，生成器类型
print(bs.body.descendants)
# parent获取当前节点的父结点
print(bs.p.parent)
# parents获取当前结点的所有祖先结点，生成器类型
print(bs.p.parents)
# next_sibling和pre_sibling获取当前结点的下一个和上一个兄弟结点
# next_siblings和pre_siblings获取当前结点之后和之前的所有兄弟结点，生成器类型，不再做示例

方法选择器find_all()

可以使用name查询，例如bs.find_all(name=’p’)

可以使用attrs查询，例如bs.find_all(attrs={‘class’:’container’})

对于常见的属性可以直接使用bs.find_all(id=’item-1’),bs.find_all(class_=’container’)

因为class是py关键字，所以使用class_

使用text查询，text的参数可以是字符串也可以是正则表达式对象，如果为字符串，刚返回内容与参数相同的tag，如果是正则表达式对象，则返回内容中包含有可以匹配正则表达式的tag。

recursive默认为True，即递归查询。

此外还有许多查询方法，用法与find_all类似，只不过作用范围不同，例如

find（）

find_parent()和find_parents()

find_next_sibling()和find_next_siblings()

find_previous_sibling()和find_previous_siblings()

find_next()和find_all_next()

find_previous()和find_all_previous()

对于find_parent和find_parents，最父级的标签是<html>标签，但是<html>标签的parent是整个html文档，所以parents的数量会比自己印象中的多一个。

css选择器

使用select（）来筛选tag，参数为一个css选择字符串

选取父结点class为panel，自身class为-panel-heading的tag
选择所有ul下的li标签
选择id为list-2的标签内所有的class为element的tag
选取第一个ul标签

标签：bs4,标签,find,html,bs,print,解析,class
From： https://www.cnblogs.com/wy12148/p/16777761.html

bs4.Beautiful

preffity（）

属性选择

方法选择器find_all()

css选择器

相关文章

赞助商

阅读排行