一、安装BeautifulSoup库
可以现在目前python安装了哪些包
安装beautifulsoup
二、beautifulsoup官网
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
三、beautifulsoup的主要解析器
四、beautifulsoup的find函数
查找html的title
from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") title_tag=bs.title.string print(title_tag) #点取元素的时候,只取第一个匹配的元素 div_tag1=bs.title print("div_tag1:"+str(div_tag1))
输出结果:
查找html中的div元素
from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") div_tag2=bs.find("div") print("div_tag2:"+str(div_tag2))
输出结果:
查找html中的所有P元素
from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") #找回所有的元素 div_tag3=bs.find_all("p") print("p:"+str(div_tag3)) for p in div_tag3: print(p.string)
输出结果:
指定id进行html查找
from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") div_tag4=bs.find(id="info") print("div_tag4:"+str(div_tag4)) div_tag5=bs.find_all("div",id="info") print("div_tag5:"+str(div_tag5))
输出结果:
正则表达式匹配元素
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") div_tag=bs.find("div",id=re.compile("info-\d+")) print(div_tag)
输出结果:
根据网页字符串定位元素
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") div_tag=bs.find(string="django打造在线教育") print(div_tag)
输出结果:
输出dom树子标签的标签名
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") div_tag=bs.find("div",id=re.compile("info-\d+")) childrens=div_tag.contents for child in childrens: if child.name: print(child.name) childrens_childrens = div_tag.descendants for child_child in childrens_childrens: if child_child.name: print(child_child.name)
输出如下:输出子标签的标签名,遍历子元素
输出dom树的父标签的标签名
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") parents=bs.find("p",{"class":"name"}).parents for parent in parents: print(parent.name)
输出结果:
输出dom树的兄弟标签的标签名
输出下一个兄弟标签的标签名
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") next_siblings=bs.find("p",{"class":"age"}).next_siblings for sibling in next_siblings: print(sibling.string)
输出结果:
输出上一个兄弟标签的标签名
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") previous_siblings=bs.find("p",{"class":"name"}).previous_siblings for sibling in previous_siblings: print(sibling.string)
输出结果:
如果要输出前一个兄弟标签的标签名,需要去掉回车换行符
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p><p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") previous_sibling=bs.find("p",{"class":"name"}).previous_sibling print(previous_sibling.string)
注意:此处html去掉回车换行符,否则无输出
输出结果:
获取html的某些标签元素的属性值
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") name_tag=bs.find("p",{"class":"name"}) print(name_tag["class"]) print(name_tag.get("class"))
输出结果:
元素多值属性问题
import re from bs4 import BeautifulSoup html=""" <html lang="en"> <head> <meta charset="UTF-8"> <title>bobby基本信息</title> <script src="jquery-3.5.1.min.js"></script> </head> <body> <div id="info-955"> <p style="color: blue">讲师信息</p> <div class="teacher_info"> Python全栈工程师 <p class="age">年龄:29</p> <p class="name bobbyname" data-bind="bobby">姓名:bobby</p> <p class="work_years">工作年限:7年</p> <p class="position">职位:python开发工程师</p> </div> <p style="color:aquamarine">课程信息</p> <table class="courses"> <tbody><tr><th>课程名称</th> <th>讲师</th> <th>地址</th> </tr><tr> <td>django打造在线教育</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/78.html">访问</a></td> </tr><tr> <td>python高级编程</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/200.html">访问</a></td> </tr><tr> <td>scrapy分布式爬虫</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/92.html">访问</a></td> </tr><tr> <td>diango rest framework打造生鲜电商</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/131.html">访问</a></td> </tr><tr> <td>tornado从入门到精通</td> <td>bobby</td> <td><a href="https://coding.imooc.com/class/290.html">访问</a></td> </tr></tbody></table> </div> </body> </html> """ bs=BeautifulSoup(html,"html.parser") name_tag=bs.find("p",{"class":"name"}) print(name_tag["class"]) print(name_tag.get("class")) print(name_tag["data-bind"]) print(name_tag.get("data-bind"))
输出结果:
标签:bobby,python,BeautifulSoup,访问,html,bs From: https://www.cnblogs.com/longlyseul/p/18199675