(一步一步学爬虫(3)网页解析之pyquery的使用)
3.3 一步一步学爬虫(3)网页解析之pyquery的使用
本来不想再抄写这部分内容,但是看了下这个方法的使用,有这么多重要的功能,还是抄写在这里,方便自己查阅,书本太厚,真的不如App方便。 上一篇的BeautifulSoup的方法,有许多不方便的,再学习pyquery的强大功能,特别是CSS方法。
3.3.1 准备工作
- 还是安装
pip3 install pyquery
3.3.2 初始化
- 在pyquery库解析HTML文本的时候,需要把这个页面初始化为一个pyquery对象。
字符串初始化
# -*- coding: UTF-8 -*-
html = '''
<html>
<body>
<div id="container">
<ul class="list">
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
</ul>
</div>
</body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
- 上面使用CSS选择器,传入li节点,可以选择所有的li节点了。
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
URL初始化
from pyquery import PyQuery as pq
doc = pq(url='https://cuiqingcai.com')
print(doc('title'))
- 运行结果。
<title>静觅丨崔庆才的个人站点 - Python爬虫教程</title>
文件初始化
- 除了上面两种情况,还可以传入本地文件名,进行初始化。
doc = pq(filename='demo.html')
print(doc('li'))
3.3.3 基本CSS选择器
html = '''
<html>
<body>
<div id="container">
<ul class="list">
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
</ul>
</div>
</body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))
-
这个代码的意思是,用CSS选择器,选取id为container的节点,再选取其内部class为list的节点内部的所有li节点。
-
当然,可以在上代码基础上,继续调用text方法,得到里面的内容。
for item in doc('#container .list li').items(): print(item.text())
-
得到如下结果。
first item second item third item fourth item fifth item
- 显然,用这个办法,比正则表达式,还要省事。
3.3.4 查找节点
(1)子节点
-
接着上面HTML代码,再用find方法,加上其参数CSS选择器,查找子节点。
from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') print(type(items)) print(items) lis = items.find('li') print(type(lis)) print(lis)
-
运行结果如下:
<class 'pyquery.pyquery.PyQuery'> <ul class="list"> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li> </ul> <class 'pyquery.pyquery.PyQuery'> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
-
上面find方法会把所有符合条件的节点都选择出来,查到的是子孙节点,结果是PyQuery类型。
-
要查找子节点,要用children方法,如下:
items = doc('.list') lis = items.children() print(type(lis)) print(lis)
(2)父节点
-
父节点就是把children换成parent即可。
items = doc('.list') lis = items.parent() print(type(lis)) print(lis)
-
结果是上层div的内容。
<class 'pyquery.pyquery.PyQuery'> <div id="container"> <ul class="list"> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li> </ul> </div>
-
如果想获取祖先节点,可以用parents方法。
items = doc('.list') lis = items.parents() print(type(lis)) print(lis)
-
这个方法会把所有父节点、祖先节点都选了出来,要想定位到某一个祖先节点,再加上一个CSS选择器即可。如:
# -*- coding: UTF-8 -*- html = ''' <html> <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <li class="item-inactive"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-1"><a rel="nofollow" href="link4.html">fourth item</a></li> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li> </ul> </div> </div> </html> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') lis = items.parents('.wrap') print(type(lis)) print(lis)
- 这样其他祖先节点,就显示不出来了。
(3)兄弟节点
-
还是上面的例子,兄弟节点用到了siblings方法。
# -*- coding: UTF-8 -*- html = ''' <html> <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li> </ul> </div> </div> </html> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list .item-0.active') lis = li.siblings() print(type(lis)) print(lis)
-
很显然,除了第三个,其余的兄弟节点都选出来了。
<class 'pyquery.pyquery.PyQuery'> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
-
要想固定某一个,也是用刚才定位祖先节点的方法,用CSS选择器。
print(lis('.active')
-
结果如下:
<class 'pyquery.pyquery.PyQuery'> <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
3.3.5 遍历节点
-
接上面继续举例
from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() print(lis) for li in lis: print(li, type(li))
-
结果生成生成器对象,再进行遍历,得到每一个节点。
<generator object PyQuery.items at 0x0000021F4CB48190> <li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li> <class 'pyquery.pyquery.PyQuery'>
(1)获取信息
- 爬取网页,主要是获取属性和文本等信息。
(2)获取属性
```python
# -*- coding: UTF-8 -*-
html = '''
<html>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a, type(a))
print(a.attr('href'))
print(a.attr.href)
```
-
运行结果如下:
<a rel="nofollow" href="link3.html">third item</a> <class 'pyquery.pyquery.PyQuery'> link3.html link3.html
-
上面调用attr方法,用两种形式,都得到了想要的属性。但是找到多个同样属性的时候,就只显示第一个,是要用遍历了。代码如下:
from pyquery import PyQuery as pq doc = pq(html) a = doc('a') for item in a.items(): print(item.attr.href)
-
结果。
D:\Programs\Python\Python310\python.exe D:\Programs\PythonProject\Practice\temp.py link1.html link2.html link3.html link4.html link5.html
(3)获取文本
-
两种方法,一个是text()方法,另一个是html()方法。
from pyquery import PyQuery as pq doc = pq(html) third = doc('.item-0.active') print(third) print(third.text()) print(third.html()) all = doc('.list') print(all.text())
-
得到了下面三种不同的结果。
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li> third item <a rel="nofollow" href="link3.html">third item</a> first item second item third item fourth item fifth item
-
第一个结果是,当前li节点的所有信息。
-
第二个结果是,当前li节点里的文本内容。
-
第三个结果是,当前li节点里的HTML内容。
-
第四个结果是,list节点下的所有文本内容。
3.3.6 节点操作
(1)addClass和removeClass
-
上面代码中,加入一个addClass和removeClass两个看看效果。
doc = pq(html) third = doc('.item-0.active') print(third) third.removeClass('active') print(third) third.addClass('active') print(third)
-
结果一看便知。
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-0"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
(2)attr、text和html
```python
# -*- coding: UTF-8 -*-
html = '''
<html>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
third = doc('.item-0.active')
print(third)
third.attr('name','link')
print(third)
third.text('changed item')
print(third)
third.html('aaaaaaaaaa没有third了')
print(third)
```
-
运行结果。
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-0 active" name="link"><a rel="nofollow" href="link3.html">third item</a></li> <li class="item-0 active" name="link">changed item</li> <li class="item-0 active" name="link">aaaaaaaaaa没有third了</li>
(3)remove
-
上代码
html = ''' <div class="wrap"> Hello,World <p>This is a paragraph.</p> </div> ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text())
-
结果如下:
Hello,World This is a paragraph.
- 这时我们只想要Hello,World,怎么办呢?
- 可以用remove方法。
wrap.find('p').remove() print(wrap.text())
- 这样一下就解决了。
3.3.7 伪类选择器
# -*- coding: UTF-8 -*-
html = '''
<html>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0"><a rel="nofollow" href="link1.html">first item</a></li>
<li class="item-1"><a rel="nofollow" href="link2.html">second item</a></li>
<li class="item-0 active"><a rel="nofollow" href="link3.html">third item</a></li>
<li class="item-1 active"><a rel="nofollow" href="link4.html">fourth item</a></li>
<li class="item-0"><a rel="nofollow" href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)
- 上面选择了第一个节点、最后一个节点、第二个、第三个之后的、偶数位置的、包含second文本的li节点。
3.3.8 总结
- 这个方法有许多强大的地方,详情可以参考官方文档。
- http://pyquery.readthedocs.io。