标签：python doc 元素 BeautifulSoup HTML print import BeautifulSoup4 find

BeautifulSoup可以从HTML、XML中提取数据。

官网https://www.crummy.com/software/BeautifulSoup/

官方中文文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装

$ pip install beautifulsoup4

或者

$ pip install bs4

初始化

from bs4 import BeautifulSoup

BeautifulSoup(markup="", features=None)

markup，被解析对象，可以是文件对象或者html字符串
features指定解析器
返回一个文档对象

from bs4 import BeautifulSoup

# 文件对象
soup = BeautifulSoup(open("test.html"))
# 标记字符串
soup = BeautifulSoup("<html>data</html>")

可以不指定解析器，就依赖系统已经安装的解析器库了。

下表列出了主要的解析器,以及它们的优缺点:

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器，效率高。

请手动指定解析器，以保证代码在所有运行环境中解析器一致。

需要安装lxml

$ pip install lxml

测试文件

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>首页</title>
</head>
<body>
<h1>欢迎您</h1>
<div id="main">
    <h3 class="title highlight"><a href="http://www.python.org">python</a>就是好用</h3>
    <div class="content">
        <p id='first'>字典</p>
        <p id='second'>列表</p>
        <input type="hidden" name="_csrf"
               value="7139e401481ef2f46ce98b22af4f4bed">
        <!-- comment -->
        <img id="bg1" src="http://www.baidu.com/">
        <img id="bg2" src="http://httpbin.org/">
    </div>
</div>
<p>bottom</p>
</body>
</html>

四种对象

BeautifulSoup将HTML文档解析成复杂的树型结构，每个节点都是Python的对象

可分为4种：

BeautifulSoup
Tag
NavigableString
Comment

BeautifulSoup对象

BeautifulSoup对象代表整个文档

Tag对象

它对应着HTML中的标签。

Tag对象有2个常用的属性：

name：Tag对象的名称，就是标签名称
attrs：标签的属性字典
- 多值属性，对于class属性可能是下面的形式，<h3 class="title highlight">python就是好用</h3> ，这个属性就是多值（{'class': ['title', 'highlight']}）
- 属性可以被修改、删除

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    # print(type(doc), doc)
    # 查标签
    # 2种方法 
    # 1、find
    # t: Tag = doc.find('h3')
    # 2、
    t: Tag = doc.h3
    print(type(t), t)
    # 继续查找标签
    f = t.find('a')
    # 查看文本
    print(f.text)
    # 查属性
    # 4种方法
    # 1、
    print(t.get('class'))  # 没有时，返回None
    # 2、
    print(t['class'])  # 没找到会报错
    # 3、
    print(t.attrs.get('class'))  # 没找到，返回None
    # 4、
    print(t.attrs['class']) #没找到会报错
    
    print(t.a.get('href'))
    # 修改属性
    t.a['href'] = 'http://www.bing.com'
    print(t.a.get('href'))
    
    #删除
    del t.a['href']
    
"""
结果
<class 'bs4.element.Tag'> <h3 class="title highlight"><a href="http://www.python.org">python</a>就是好用</h3>
python
['title', 'highlight']
['title', 'highlight']
['title', 'highlight']
['title', 'highlight']
http://www.baidu.org
http://www.bing.com
"""

NavigableString

如果只想输出标记内的文本，而不关心标记的话，就要使用NavigableString。

当标签没有子标签时，可以使用NavigableString

有子标签则返回None

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    # print(type(doc), doc)
    
    print(doc.h3.a.string)

Comment注释对象

这就是HTML中的注释，它被BeautifulSoup解析后对应Comment对象。

遍历文档树

在文档树中找到关心的内容才是日常的工作，也就是说如何遍历树中的节点。使用上面的test.html来测试

使用Tag

doc.div 可以找到从根节点开始查找第一个div节点
doc.div.p 说明从根节点开始找到第一个div后返回一个Tag对象，这个Tag对象下继续找第一个p，找到返回Tag对象
doc.p 说明遍历是深度优先，返回了文字“字典”，而不是文字“bottom”。

遍历直接子节点

doc.div.contents 将对象的所有类型直接子节点以列表方式输出
doc.div.children 返回contents迭代器iter(self.contents)

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    
    #直接子，包含文本
    #生成器的形式
    print(*doc.children)
    #列表形式
    print(doc.contents)

遍历所有子孙节点

doc.div.descendants 返回第一个div节点的所有类型子孙节点，可以看出迭代次序是深度优先

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    
    
    #所有的子子孙孙，包括文本
    print(*doc.descendants,sep='||')

遍历字符串

在前面的例子中，soup.div.string返回None，是因为string要求soup.div只能有一个NavigableString类型子节点，也就是如这样<div>only string</div> 。

print(doc.div.string) # 返回None，因为多于1个子节点
print("".join(doc.div.strings)) # 返回迭代器，带多余的空白字符
print("".join(doc.div.stripped_strings)) #  返回迭代器，去除多余空白符

遍历祖先节点

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    
    #父辈
    print(doc.div.parent.name)
    #所有祖先
    print(*map(lambda x:x.name,doc.div.p.parents))

遍历兄弟节点

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    
    #同辈的兄弟下一个标签
    print(doc.a.next_sibling)
    #之后的同辈兄弟所有的标签
    print(*doc.div.h3.next_siblings,sep='|||')

遍历其他元素

next_element是下一个可被解析的对象（字符串或tag），无论什么关系，和下一个兄弟节点next_sibling不一样

from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    
    #之后的标签，无论关系
    print(doc.div.div.next_element.next_elements)
    #之后所有的标签，无论关系
    print(*doc.div.div.next_element.next_elements,sep='||||')

搜索文档树

find系有很多方法，请自行查帮助

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) -> list

name

官方称为filter过滤器，过滤标签的。这个参数可以是以下类型：

1、字符串

一个标签名称的字符串，会按照这个字符串全长匹配标签名

#标签 完全匹配
print(doc.find_all('p'))

2、正则表达式对象

按照“正则表达式对象”的模式匹配标签名

import re


#正则表达式 包含
print(doc.find_all(re.compile('p|a')))
#正则表达式 完全匹配 或关系
print(doc.find_all(re.compile('^(p|a)$')))

3、列表

#完全匹配 或关系
print(doc.find_all(['p','a']))

4、True或None

True或None，则find_all返回全部非字符串节点、非注释节点，就是Tag标签类型

print(list(map(lambda x:x.name, soup.find_all(True))))
print(list(map(lambda x:x.name, soup.find_all(None))))
print(list(map(lambda x:x.name, soup.find_all())))

源码中确实上面三种情况都返回的Tag类型

5、函数

如果使用以上过滤器还不能提取出想要的节点，可以使用函数，此函数仅只能接收一个参数。

如果这个函数返回True，表示当前节点匹配；返回False则是不匹配。

返回多个属性的标签

from bs4 import BeautifulSoup
from bs4.element import Tag

def fn(x: Tag):
    # if isinstance(x, Tag):
    #     # print(x.name)
    #     # print(x.attrs)
    #     for i, j in x.attrs.items():
    #         if isinstance(j, list) and len(j) > 1:
    #             print(x.name, '=============')
    #             return True
    #     return False
    return isinstance(x, Tag) and len(x.get('class', [])) > 1

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    
    #调函数，把标签给函数处理，遍历所有标签
    print(doc.find_all(fn),'-----------')

keyword传参

使用关键字传参，如果参数名不是find系函数已定义的位置参数名，参数会被kwargs收集并被当做标签的属性来搜索。

属性的传参可以是字符串、正则表达式对象、True、列表、函数。

import re
from bs4 import BeautifulSoup
from bs4.element import Tag


def fn(x):
    print(type(x), x)
    return x == 'bg1'


with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    # 通过attrs查找
    print(doc.find_all(attrs={'id': "first"}))
    # 通过kwages查找
    # id='first'的标签
    print(doc.find_all(id='first'))
    # 存在src的标签
    print(doc.find_all(src=True))
    # 同时拥有src和id的标签 与关系
    print(doc.find_all(id=True, src=True))
    # id='second'或者id以数字结尾的标签 或关系
    print(doc.find_all(id=['second', re.compile('\d$')]))
    # id是以数字结尾的同时拥有src的标签 与关系
    print(doc.find_all(id=re.compile('\d$'), src=True))
    print('=' * 100)
    # 使用函数,使用函数对数据进行处理
    print(doc.find_all(id=fn))

css的class的特殊处理

class是Python关键字，所以使用class_ 。

class是多值属性，可以匹配其中任意一个，也可以完全匹配。

import re
from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    # calss中有‘title’的标签
    print(doc.find_all(class_='title'))
    # 正则表达式
    print(doc.find_all(class_=re.compile('e$')))
    # 可以使用任意一个css类
    print(doc.find_all(class_="title"))
    # 顺序错了，找不到
    print(doc.find_all(class_="highlight title"))
    # 顺序一致，找到，就是字符串完全匹配
    print(doc.find_all(class_="title highlight"))

attrs参数

attrs接收一个字典，字典的key为属性名，value可以是字符串、正则表达式对象、True、列表。可以多个属性

import re
from bs4 import BeautifulSoup
from bs4.element import Tag

with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    print(doc.find_all(attrs={'class': 'title'}))
    print(doc.find_all(attrs={'class': 'highlight'}))
    print(doc.find_all(attrs={'class': 'title highlight'}))
    print(doc.find_all(attrs={'id': True}))
    print(doc.find_all(attrs={'id': re.compile(r'\d$')}))
    print(list(map(lambda x: x.name, doc.find_all(attrs={
        'id': True, 'src': True
    }))))

text参数

可以通过text参数搜索文档中的字符串内容，接受字符串、正则表达式对象、True、列表、函数

import re
from bs4 import BeautifulSoup
from bs4.element import Tag


def fn(t):
    return True if t != '\n' else False


with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    #所有文本
    print(doc.find_all(text=True))
    #字符串
    print(doc.find_all(text='字典'))
    #使用函数，对文本进行过滤
    print(doc.find_all(text=fn))
    # 非空格的
    print(doc.find_all(text=re.compile('[^\n]')))
    # 全因为文本
    print(doc.find_all(text=re.compile('^[ a-zA-Z]+$')))
    # 结合其他的，找到对应的标签
    print(doc.find_all('a', text=re.compile('^[ a-zA-Z]+$')))
    # 找到对应的标签，再遍历出文本
    x = doc.find_all('a', text=re.compile('^[ a-zA-Z]+$'))
    for i in x:
        i: Tag
        print(i.text)

limit参数

限制返回结果的数量

print(doc.find_all(id=True, limit=3)) # 返回列表中有3个结果

recursive 参数

默认是递归搜索所有子孙节点，如果不需要请设置为False

简化写法

find_all()是非常常用的方法，可以简化省略掉

import re
from bs4 import BeautifulSoup
from bs4.element import Tag


def fn(t):
    return True if t != '\n' else False


with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    print(doc('img'))  # 所有img标签对象的列表，不等价于soup.img
    print(doc.img)  # 深度优先第一个img
    print(doc.h3)
    print(doc.h3.find_all(text=True))  # 返回文本列表
    print(doc.h3(text=True))  # 同上等价
    print(doc('p', text=True))  # 返回含有文本的p标签对象
    print(doc.find_all('img', attrs={'id': 'bg1'}))
    print(doc('img', attrs={'id': 'bg1'}))  # find_all的省略
    print(doc('img', attrs={'id': re.compile('1')}))

find方法

find( name , attrs , recursive , text , **kwargs )

参数几乎和find_all一样。

找到了，find_all返回一个列表，而find返回一个单值，元素对象。

找不到，find_all返回一个空列表，而find返回一个None。

print(doc.find('img', attrs={'id': 'bg1'}).attrs.get('src', 'magedu'))
print(doc.find('img', attrs={'id': 'bg1'}).get('src'))  # 简化了attrs
print(doc.find('img', attrs={'id': 'bg1'})['src'])

CSS选择器

和JQuery一样，可以使用CSS选择器来查找节点

使用doc.select()方法，select方法支持大部分CSS选择器，返回列表。CSS中，标签名直接使用，类名前加.点号，id名前加#井号。

import re
from bs4 import BeautifulSoup
from bs4.element import Tag


with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    # 根据标签来搜
    print(doc.select('p'))
    # 根据属性来搜
    print(doc.select('#first'))
    # 根据类来搜
    print(doc.select('.a'))
    # 分组 或关系 用逗号
    print(doc.select('#first,#bg1'))
    # 子标签 用空格
    print(doc.select('.content p'))
    # 取第2个
    print(doc.select('.content p:nth-of-type(2)'))
    # +取亲兄弟
    print(doc.select('.content p:nth-of-type(1) + p'))
    # ~取同辈兄弟
    print(doc.select('.content p:nth-of-type(2) ~ input'))
    # 属性
    print(doc.select('[src],a[href]'))
    print(doc.select('[id=bg2]'))
    # class
    # 多值
    print(doc.select('[class~=title]'))
    # 包含
    print(doc.select('[class*=tit]'))
    # 开头
    print(doc.select('[class^=tit]'))
    # 结尾
    print(doc.select('[class$=light]'))
    # 完全匹配
    print(doc.select('[class="title highlight"]'))
    # 有属性src
    print(doc.select('[src]'))
    # 属性src等于/
    print(doc.select('[src="/"]'))
    # 完全匹配
    print(doc.select('[src="http://www.magedu.com/"]'))
    # 前缀匹配
    print(doc.select('[src^="http://www"]'))
    # 后缀匹配
    print(doc.select('[src$="com/"]'))
    # 包含匹配
    print(doc.select('img[src*="magedu"]'))
    # 包含匹配
    print(doc.select('img[src*=".com"]'))
    # 多值属性中有一个title
    print(doc.select('[class~=title]'))

附

https://www.w3school.com.cn/cssref/css_selectors.asp

选择器	例子	例子描述
`.class`	`.intro`	选择 class="intro" 的所有元素。
`.class1.class2`	`.name1.name2`	选择 class 属性中同时有 name1 和 name2 的所有元素。
`.class1 .class2`	`.name1 .name2`	选择作为类名 name1 元素后代的所有类名 name2 元素。
`#id`	`#firstname`	选择 id="firstname" 的元素。
`*`	`*`	选择所有元素。
`element`	`p`	选择所有 `<p>` 元素。
`element.class`	`p.intro`	选择 class="intro" 的所有 `<p>` 元素。
`element,element`	`div, p`	选择所有 `<div>` 元素和所有 `<p>` 元素。
`element element`	`div p`	选择 `<div>` 元素内的所有 `<p>` 元素。
`element>element`	`div > p`	选择父元素是 `<div>` 的所有 `<p>` 元素。
`element+element`	`div + p`	选择紧跟 `<div>` 元素的首个 `<p>` 元素。
`element1~element2`	`p ~ ul`	选择前面有 `<p>` 元素的每个 `<ul>` 元素。
`[attribute]`	`[target]`	选择带有 target 属性的所有元素。
`[attribute=value]`	`[target=_blank]`	选择带有 target="_blank" 属性的所有元素。
`[attribute~=value]`	`[title~=flower]`	选择 title 属性包含单词 "flower" 的所有元素。
[attribute\|=value]	[lang\|=en]	选择 lang 属性值以 "en" 开头的所有元素。
`[attribute^=value]`	`a[href^="https"]`	选择其 src 属性值以 "https" 开头的每个 `<a>` 元素。
`[attribute$=value]`	`a[href$=".pdf"]`	选择其 src 属性以 ".pdf" 结尾的所有 `<a>` 元素。
`[attribute*=value]`	`a[href*="w3schools"]`	选择其 href 属性值中包含 "abc" 子串的每个 `<a>` 元素。
`:active`	`a:active`	选择活动链接。
`:after`	`p:after`	在每个 `<p>` 的内容之后插入内容。
`:before`	`p:before`	在每个 `<p>` 的内容之前插入内容。
`:checked`	`input:checked`	选择每个被选中的 `<input>` 元素。
`:default`	`input:default`	选择默认的 `<input>` 元素。
`:disabled`	`input:disabled`	选择每个被禁用的 `<input>` 元素。
`:empty`	`p:empty`	选择没有子元素的每个 `<p>` 元素（包括文本节点）。
`:enabled`	`input:enabled`	选择每个启用的 `<input>` 元素。
`:first-child`	`p:first-child`	选择属于父元素的第一个子元素为`<p>`标签的元素。
`:first-letter`	`p:first-letter`	选择每个 `<p>` 元素的首字母。
`:first-line`	`p:first-line`	选择每个 `<p>` 元素的首行。
`:first-of-type`	`p:first-of-type`	选择属于其父元素的首个为`<p>`标签的元素。
`:focus`	`input:focus`	选择获得焦点的 input 元素。
`:fullscreen`	`:fullscreen`	选择处于全屏模式的元素。
`:hover`	`a:hover`	选择鼠标指针位于其上的链接。
`:in-range`	`input:in-range`	选择其值在指定范围内的 input 元素。
`:indeterminate`	`input:indeterminate`	选择处于不确定状态的 input 元素。
`:invalid`	`input:invalid`	选择具有无效值的所有 input 元素。
`:lang(language)`	`p:lang(it)`	选择 lang 属性等于 "it"（意大利）为标签的元素。
`:last-child`	`p:last-child`	选择属于其父元素最后一个子元素为 `<p>`标签的元素。
`:last-of-type`	`p:last-of-type`	选择属于其父元素的最后一个 `<p>`标签的元素。
`:link`	`a:link`	选择所有未访问过的链接。
`:not(selector)`	`:not(p)`	选择非 `<p>` 元素的每个元素。
`:nth-child(n)`	`p:nth-child(2)`	选择属于其父元素的第二个子元素为 `<p>`标签的元素。
`:nth-last-child(n)`	`p:nth-last-child(2)`	同上，从最后一个子元素开始计数。
`:nth-of-type(n)`	`p:nth-of-type(2)`	选择属于其父元素第二个为`<p>` 标签的元素。
`:nth-last-of-type(n)`	`p:nth-last-of-type(2)`	同上，但是从最后一个子元素开始计数。
`:only-of-type`	`p:only-of-type`	选择属于其父元素唯一为`<p>` 标签的元素。不唯一则无法选择
`:only-child`	`p:only-child`	选择属于其父元素的唯一子元素为`<p>` 标签的元素。不唯一则无法选择。
`:optional`	`input:optional`	选择不带 "required" 属性的 input 元素。
`:out-of-range`	`input:out-of-range`	选择值超出指定范围的 input 元素。
`::placeholder`	`input::placeholder`	选择已规定 "placeholder" 属性的 input 元素。
`:read-only`	`input:read-only`	选择已规定 "readonly" 属性的 input 元素。
`:read-write`	`input:read-write`	选择未规定 "readonly" 属性的 input 元素。
`:required`	`input:required`	选择已规定 "required" 属性的 input 元素。
`:root`	`:root`	选择文档的根元素。
`::selection`	`::selection`	选择用户已选取的元素部分。
`:target`	`#news:target`	选择当前活动的 #news 元素。
`:valid`	`input:valid`	选择带有有效值的所有 input 元素。
`:visited`	`a:visited`	选择所有已访问的链接。

获取文本内容

搜索节点的目的往往是为了提取该节点的文本内容，一般不需要HTML标记，只需要文字

from bs4 import BeautifulSoup
from bs4.element import Tag


with open('test.html', encoding='utf8') as f:
    doc = BeautifulSoup(f, 'lxml')
    # 如果contents不为1，则返回None
    print(doc.div.strings)
    # 返回标签下的所有文本，包括换行
    print(*doc.div.strings)
    # 去掉换行和空串
    print(*doc.div.stripped_strings)
    # text 显示所有文本，但为一个字符串
    print(type(doc.div.text), doc.div.text)
    # 用换行符隔开，并去掉回车和空行
    print(doc.div.get_text(separator='\n', strip=True))

练习

from bs4 import BeautifulSoup
import requests

url = "https://www.douban.com/"
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'}
with requests.get(url=url, headers=headers) as res:
    doc = res.content
    #把获取到的页面给bs4用lxml解析，可以是str，也可以是bytes
    doc = BeautifulSoup(doc, 'lxml')
    #通过css选择器定位，找到位置，通过生成器遍历出对应标签的文本
    print([i.text for i in doc.select("div.mod ul li.rec_topics a")])
    
    
"""
['豆瓣野生艺术摄影大赛', '不开火下厨指南', '凌晨一点钟，你在做什么？', '那些氛围感满满的灯', '我去过最奇特的一条小吃街', '咖啡翻车小剧场']
"""

标签：python,doc,元素,BeautifulSoup,HTML,print,import,BeautifulSoup4,find
From： https://www.cnblogs.com/guangdelw/p/17081318.html

python爬虫（三）- HTML解析之BeautifulSoup4

安装

初始化