标签:story 17 Lacie bs4 Dormouse soup Elsie select
bs4解析
下载 -- pip install bs4
示例代码-爱丽丝漫游仙境
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
使用bs4格式化输出代码
from bs4 import BeautifulSoup
# lxml为解析器
soup = BeautifulSoup(html_doc,"lxml")
# 格式化输出代码
print(soup.prettify())
推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.
浏览结构化数据的方法
print(soup.title)
# <ittle>The Dormouse's story</title>
print(soup.title.name)
# u'title'
print(soup.title.string)
# u'The Dormouse's story'
print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>
print(soup.p['class'])
# u'title'
print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
匹配所有a标签的href属性
for link in soup.find_all("a"):
print(link.get("href"))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
获取所有文本内容
print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
遍历文档树
以爱丽丝文档为例
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
子节点
一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.
--- 操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 标签
soup.head
>>> <head><title>The Dormouse's story</title></head>
soup.title
>>> <title>The Dormouse's story</title>
这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.
--- 下面的代码可以获取标签中的第一个
标签:
soup.body.p
>>> <p>The Dormouse's story</p>
通过点取属性的方式只能获得当前名字的第一个tag:
soup.a
>>> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
父节点
每个tag或字符串都有父节点:被包含在某个tag中
parent
通过 parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,标签是
标签的父节点:</p>
<pre><code class="language-python">title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
</code></pre>
<h4 id="兄弟节点">兄弟节点</h4>
<blockquote>
<p>看一段代码</p>
</blockquote>
<pre><code class="language-python">soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(soup.prettify())
# <html>
# <body>
# <a>
# <b>
# text1
# </b>
# <c>
# text2
# </c>
# </a>
# </body>
# </html>
</code></pre>
<p> --- 因为<b>标签和<c>标签是同一层:他们是同一个元素的子节点,所以<b>和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系</p>
<p><strong>next_sibling 和 previous_sibling</strong></p>
<blockquote>
<p><strong>在文档树中,使用 next_sibling 和 previous_sibling属性来查询兄弟节点:</strong></p>
</blockquote>
<pre><code class="language-python"># 下一个兄弟节点
soup.b.next_sibling
>>> <c>text2</c>
# 上一个兄弟节点
soup.c.previous_sibling
>>> <b>text1</b>
</code></pre>
<h3 id="搜索文档树">搜索文档树</h3>
<blockquote>
<p>Beautiful Soup定义了很多搜索方法,这里着重介绍2个: <code>find()</code> 和 <code>find_all()</code> .其它方法的参数和用法类似,请读者举一反三.</p>
</blockquote>
<p> <strong>依旧以爱丽丝文档为例</strong></p>
<pre><code class="language-python">html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
</code></pre>
<h4 id="字符串">字符串</h4>
<blockquote>
<p>最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:</p>
</blockquote>
<pre><code class="language-python">soup.find_all('b')
>>> [<b>The Dormouse's story</b>]
</code></pre>
<h4 id="列表">列表</h4>
<blockquote>
<p>如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:</p>
</blockquote>
<pre><code class="language-python">soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
</code></pre>
<h4 id="按css搜索">按CSS搜索</h4>
<blockquote>
<p>按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定</p>
</blockquote>
<pre><code class="language-python">soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
</code></pre>
<p>--- limit参数</p>
<blockquote>
<p>文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:</p>
</blockquote>
<pre><code class="language-python">soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
</code></pre>
<h4 id="css选择器">css选择器</h4>
<blockquote>
<p>Beautiful Soup支持大部分的CSS选择器 , 在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:</p>
</blockquote>
<pre><code class="language-python">soup.select("title")
# [<title>The Dormouse's story</title>]
soup.select("p:nth-of-type(3)")
# [<p class="story">...</p>]
</code></pre>
<h5 id="-----通过tag标签逐层查找">--- 通过tag标签逐层查找</h5>
<pre><code class="language-python">soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("html head title")
# [<title>The Dormouse's story</title>]
</code></pre>
<h5 id="-----标签下的直接子标签">--- 标签下的直接子标签</h5>
<pre><code class="language-python">soup.select("head > title")
# [<title>The Dormouse's story</title>]
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select("body > a")
# []
</code></pre>
<h5 id="----通过css类名查找">--- 通过css类名查找</h5>
<pre><code class="language-python">soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
</code></pre>
<h5 id="----通过tag的id查找">--- 通过tag的id查找</h5>
<pre><code class="language-python">soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
</code></pre>
标签:story,
17,
Lacie,
bs4,
Dormouse,
soup,
Elsie,
select
From: https://www.cnblogs.com/blog4lyh/p/16904853.html