标签：对象 BeautifulSoup find soup tag 文档 print 爬虫节点

一、遍历文档树介绍

1、简介

在Beautiful Soup中，遍历文档树是指访问和操作HTML或XML文档的各个部分，包括标签、字符串内容等。

遍历文档树，也被称为导航文档树，是指在一个文档对象模型（DOM）中按照特定的方法和规则来遍历和浏览其中的节点。

DOM是一种处理XML或HTML文档的标准编程接口，它将文档解析成由节点和对象组成的树状结构。

在遍历文档树的过程中，可以通过访问当前节点及其相关属性、子节点、父节点、兄弟节点等信息，来对文档进行操作和分析。

2、常见的文档数遍历算法

在处理HTML或XML文档时，常见的文档树遍历算法包括以下几种：

深度优先遍历（Depth-First Traversal）：
- 前序遍历（Pre-order Traversal）：首先访问根节点，然后递归地对每个子树进行前序遍历。
- 中序遍历（In-order Traversal）：先遍历左子树，然后访问根节点，最后遍历右子树。
- 后序遍历（Post-order Traversal）：先遍历左右子树，最后访问根节点。
广度优先遍历（Breadth-First Traversal）：
- 从根节点开始，逐层遍历文档树的节点，先访问同一层的所有节点，然后再进入下一层。
递归遍历：
- 使用递归方法遍历文档树，可以根据需要实现深度优先或广度优先遍历。
迭代遍历：
- 使用循环结构和栈或队列等数据结构来遍历文档树，可以实现深度优先或广度优先遍历。

在实际应用中，根据具体的需求和情况选择合适的文档树遍历算法。深度优先遍历常用于查找特定节点或路径，而广度优先遍历适合用于搜索最短路径或层级关系等场景。递归遍历通常简洁易懂，但需要注意递归深度限制。迭代遍历则可以避免递归深度限制的问题，适用于大型文档树的遍历。

二、遍历文档树语法

首先导入BeautifulSoup库，并将HTML文档传入BeautifulSoup对象的构造函数中，指定解析器（这里使用lxml）。

from bs4 import BeautifulSoup

html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

操作文档树最简单的方法就是告诉它你想获取的tag的name

1、获取某个具体的tag对象的方法

（1）获取标签的名称

使用tag.name属性可以获取当前标签的名称。

（2）获取标签的属性

使用tag.attrs属性可以获取当前标签的属性字典。

（3）获取标签的内容

① tag.string

使用tag.string属性可以获取当前标签内的文本内容。

② tag.strings

使用tag.strings方法可以获取当前标签内所有子节点的文本内容，返回一个生成器对象。

print(soup.p.strings)
# <generator object Tag._all_strings at 0x120eff370>
print(list(soup.p.strings))
# ["The Dormouse's story"]

③ tag.text

使用tag.text属性可以获取当前标签内所有子节点的文本内容，并将其连接在一起。

④ tag.stripped_strings

使用tag.stripped_strings方法可以获取当前标签内所有子节点的文本内容，并去掉多余的空白字符。该方法返回一个生成器对象。

（4）嵌套选择

嵌套选择可以通过访问父子节点的方式来获取特定标签的文本内容。
在给定的示例中，我们使用text属性来访问特定标签的文本内容。

print(soup.head.title.text)  
# 输出：The Dormouse's story

print(soup.body.a.text)  
# 输出：Elsie

2、子节点

（1）.contents

tag的.content属性可以将tag的子节点以列表的方式输出。列表中的元素不止有tag对象还有navgableString对象(navgableString对象也是列表中的一个元素)
值得注意的是，哪怕只有一个换行\n也会占用contents一个位置
.contents属性只能找出一个节点的子节点，如果某个子节点还存在子节点，那么子节点的子节点会与子节点作为一体输出
字符串没有.contents属性,因为字符串没有子节点

tag_head = soup.head
tag_title = tag_head.contents  # 获取某个节点的子节点
print(tag_head)
print(tag_title)

for i in soup.find_all("p"):  # 获取某个全部标签对下的子节点
    print(i.contents)


for i in soup.find_all("p"):
    for n in i.contents:
        if str(type(n)) == "<class 'bs4.element.Tag'>":
            print("tag对象有：",n)
        else:
            print("字符串对象有：",n)

输出：
"""
<head>
<title>The Dormouse's story</title>
</head>
['\n', <title>The Dormouse's story</title>, '\n']
[<b>The Dormouse's story</b>]
['Once upon a time there were three little sisters; and their names were \n    ', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ', \n    ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and \n    ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well.\n  ']
['...']
"""
"""
tag对象有： <b>The Dormouse's story</b>
字符串对象有： Once upon a time there were three little sisters; and their names were 
    
tag对象有： <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
字符串对象有： , 
    
tag对象有： <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
字符串对象有：  and 
    
tag对象有： <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
字符串对象有： ; and they lived at the bottom of a well.
  
字符串对象有： ...
"""

（2）.children

.children返回的是一个list类型的迭代器
使用.children属性可以获取当前标签的子节点的迭代器，通过遍历这个迭代器可以访问子节点。

tag_head = soup.head
tag_title = tag_head.children    #获取某个节点的子节点
print(tag_head)
print(tag_title)
 
for i in soup.find_all("p"):    #获取某个全部标签对下的子节点
    tag = i.children  #以生成器的形式返回
    print(tag)
    print("--------")
    for n in tag:
        print(n)
 
"""
<head>
<title>The Dormouse's story</title>
</head>
<list_iterator object at 0x0000028AD1BB7470>
<list_iterator object at 0x0000028AD1BB74E0>
--------
<b>The Dormouse's story</b>
<list_iterator object at 0x0000028AD1BB7550>
--------
Once upon a time there were three little sisters; and their names were 
    
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
, 
    
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and 
    
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
  
<list_iterator object at 0x0000028AD1BB74E0>
--------
...
"""

（3）.descendants

.contents和.children属性仅包含tag的直接子节点：只会返回目标tag对象下的子节点，若子节点下还有子节点(孙节点)则是与子节点作为一体返回的
.descendants属性可以对所有tag的子孙节点进行递归循环
.descendants同样是list迭代器，只不过指的是子孙节点，用法同.children
输出数据中不止有tag对象还有navgableString对象
<title>The</title>也包含一个子节点(字符串The),这种情况下字符串"The"也属于<title>标签的子孙节点。此时对<title>使用.contents属性时会返回"The"

tag_body = soup.body
tag_title = tag_body.descendants

# for i in tag_title:
#     print(i)

for i in tag_title:
    if str(type(i)) == "<class 'bs4.element.Tag'>":
        print("tag对象有：",i)


"""
<p class="title" name="first_p"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...
"""
"""
tag对象有： <p class="title" name="first_p"><b>The Dormouse's story</b></p>
tag对象有： <b>The Dormouse's story</b>
tag对象有： <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
tag对象有： <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
tag对象有： <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
tag对象有： <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
tag对象有： <p class="story">...</p>
"""

3、父节点

使用.parent属性可以获取当前标签的父节点。
而.parents属性则可以获取标签的所有祖先节点，从父亲的父亲开始一直到最顶层的祖先节点。

# 获取b标签的父节点
print(soup.b.parent)
print('---------')
# 获取b标签的父节点的文本内容
print(soup.b.parent.text)
print('---------')
# 找到b标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
print(soup.b.parents)
print('---------')
parents = soup.b.parents
for parent in parents:
    print(parent)
    
"""
<p class="title" name="first_p"><b>The Dormouse's story</b></p>
---------
The Dormouse's story
---------
<generator object PageElement.parents at 0x000001A37C9CC400>
---------
<p class="title" name="first_p"><b>The Dormouse's story</b></p>

<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
"""

4、兄弟节点

使用.next_sibling属性可以获取下一个兄弟节点。
使用.previous_sibling属性可以获取上一个兄弟节点。
此外，.next_siblings属性返回一个生成器对象，可以逐个访问后面的兄弟节点。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
</body>
"""

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.a.next_sibling)  # 输出：<class 'bs4.element.NavigableString'>
print("-------------")
print(soup.a.next_sibling.next_sibling) #下一个兄弟
print("-------------")
print(soup.a.previous_sibling.previous_sibling) #上一个兄弟
print("-------------")
print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
print("-------------")
print(soup.a.previous_siblings)

"""
,

-------------
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
-------------
None
-------------
[',\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\nand they lived at the bottom of a well.']
-------------
<generator object PageElement.previous_siblings at 0x00000260D20FC400>
"""

四、搜索文档树

1、查找多个 find_all（）

find_all() 是 Beautiful Soup 库中用于查找文档中所有符合条件的元素的方法。该方法可以根据标签名、属性、文本内容等条件来查找元素，并返回一个列表，包含所有符合条件的元素。

在 Beautiful Soup 中，find_all() 方法的基本语法如下：

find_all(name, attrs, recursive, text, limit, **kwargs)

name：要查找的标签名，可以是字符串、正则表达式、列表等。
- 通过find_all(True)可以匹配所有的tag，不会返回字符串节点。
- 如果没有合适的过滤器，可以定义一个方法来匹配元素。
  - soup.find_all(name=has_class_but_no_id)
attrs：要匹配的属性字典，例如 {"class": "header"}。
recursive：是否递归查找子孙节点，默认为 True。
text：要匹配的文本内容。
limit：限制返回结果的数量。
**kwargs：其他关键字参数，用于匹配更多条件。

示例：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
<title>Example</title>
</head>
<body>
<div class="content">
<p>This is a paragraph.</p>
<p class="special">This is a special paragraph.</p>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 查找所有 p 标签
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

# 查找 class 为 special 的 p 标签
special_paragraphs = soup.find_all('p', class_='special')
for p in special_paragraphs:
    print(p.text)

在这个示例中，find_all() 方法用于查找文档中的 <p> 标签，以及具有 class_="special" 属性的 <p> 标签。根据条件，它会返回相应的元素列表供进一步处理。

2、查找单个 find（）

find() 方法是 Beautiful Soup 库中用于查找文档中第一个符合条件的元素的方法。与 find_all() 方法不同，find() 方法只返回第一个匹配的元素，而不是返回所有符合条件的元素列表。

在 Beautiful Soup 中，find() 方法的基本语法如下：

find(name, attrs, recursive, text, **kwargs)

name：要查找的标签名，可以是字符串、正则表达式、列表等。
attrs：要匹配的属性字典，例如 {"class": "header"}。
recursive：是否递归查找子孙节点，默认为 True。
text：要匹配的文本内容。
**kwargs：其他关键字参数，用于匹配更多条件。

示例：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
<title>Example</title>
</head>
<body>
<div class="content">
<p>This is a paragraph.</p>
<p class="special">This is a special paragraph.</p>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 查找第一个 p 标签
first_paragraph = soup.find('p')
print(first_paragraph.text)

# 查找第一个 class 为 special 的 p 标签
special_paragraph = soup.find('p', class_='special')
print(special_paragraph.text)

在这个示例中，find() 方法用于查找文档中的第一个 <p> 标签，以及具有 class="special" 属性的第一个 <p> 标签。根据条件，它会返回第一个匹配的元素供进一步处理。

3、find_all() 和 find() 比较

find_all() 和 find() 是 Beautiful Soup 库中用于查找元素的两个常用方法，它们之间的主要区别在于返回结果的形式和数量：

find_all()：
- find_all() 方法会返回所有符合条件的元素，以列表的形式返回。
- 如果没有找到匹配的元素，将返回一个空列表。
- 语法：find_all(name, attrs, recursive, text, limit, **kwargs)
find()：
- find() 方法会返回第一个符合条件的元素。
- 如果没有找到匹配的元素，将返回 None。
- 语法：find(name, attrs, recursive, text, **kwargs)

总结区别：

find_all() 返回一个列表，包含了所有匹配的元素；而 find() 返回的是第一个匹配的元素。
如果你只对文档中第一个匹配的元素感兴趣，可以使用 find()；如果你想获取所有匹配的元素，可以使用 find_all()。

在实际使用中，根据具体需求选择合适的方法来查找文档中的元素。

五、总结

只要是一个tag对象，就可以使用下面的tag对象的属性和方法(以tag对象开头的)

目的	方法	返回值	描述
获取soup对象	BeautifulSoup("XML文件","解析器")	soup对象	BeautifulSoup方法用于返回一个HTML等文件的Soup对象
获取tag对象	soup对象.标签名	tag对象	返回第一个符合要求的tag对象(标签对)
	soup对象.find("标签名")	tag对象	返回第一个符合要求的tag对象(标签对)
	soup对象.find_all("标签名")	列表	返回整个文档中由全部符合要求的tag对象(标签对)组成的列表
	soup对象.select("标签名")	列表	返回整个文档中由全部符合要求的tag对象(标签对)组成的列表
获取tag对象的name	tag对象.name (可被修改)	字符串	返回一个tag对象的名字。可先通过find_all()方法，返回全部符合要求的tag对象
获取tag对象的attrs	tag对象.attrs (可被修改)	字典	返回一个由tag对象的属性组成的字典。可先通过find_all()方法，返回全部符合要求的tag对象
获取NavigableString对象	tag对象.string (可被修改)	NavigableString对象	返回tag对象的NavigableString对象。可先通过find_all()方法，返回全部符合要求的tag对象
	tag对象.strings	迭代器	返回一个节点下所有的NavigableString对象，包括子孙节点的
获取tag对象的子节点	tag对象.contents	列表	以列表的形式返回节点的所有节点，包括NavigableString对象
	tag对象.children	list类型的迭代器	以迭代器的形式返回节点下的所有子节点，包括NavigableString对象
获取tag对象的子孙节点	tag对象.descendants	list类型的迭代器	以迭代器的形式返回节点下的所有子孙节点，包括NavigableString对象
其他	.parent		获取某个元素的父节点
	.parents		可以递归得到元素的所有父辈节点
	.next_sibling和.previous_sibling		属性来查询兄弟节点
	.next_siblings和.previous_siblings		可以对当前节点的兄弟节点迭代输出
	.next_elements 和 .previous_elements		可以向前或向后访问文档的解析内容
	find_parents() 和 find_parent()		用来搜索当前节点的父辈节点
	find_next_siblings()		返回所有符合条件的后面的兄弟节点
	find_previous_siblings()		返回所有符合条件的前面的兄弟节点
	find_all_next()		返回所有符合条件的节点
	find_all_previous()		返回所有符合条件的节点
	new_string()		添加一段文本内容到文档中
	insert()		把元素插入到指定的位置
	insert_before()		在当前tag或文本节点前插入内容
	insert_after()		在当前tag或文本节点后插入内容

标签：对象,BeautifulSoup,find,soup,tag,文档,print,爬虫,节点
From： https://www.cnblogs.com/xiao01/p/18104296

爬虫之BeautifulSoup文档树操作