爬虫2(页面解析和数据提取)
处理HTML文件,常用Xpath,先将HTML文件转换成XML文档,然后用Xpath查找HTML节点或元素。
一、HTML与XML
二、XPath
1、XPath路径表达式
三、Lxml库
html = etree.HTML(text) # 将字符串转换成HTML格式
# print(etree.tostring(html)) # 补全HTML
result = html.xpath('//li/a/text()') # 获取li标签下的a标签的文本
r = html.xpath('//li[contains(@class,"item-")]') # 获取属性为class且包含item-的li标签
案例:下载百度贴吧页面的图片
import requests
from lxml import etree
index_url = 'https://tieba.baidu.com/p/5475267611'
response = requests.get(index_url).text
selector = etree.HTML(response)
image_urls = selector.xpath('//img[@class="BDE_Image"]/@src')
offset = 0
for image_url in image_urls:
image_content = requests.get(image_url).content
with open('{}.jpg'.format(offset), 'wb') as f:
f.write(image_content)
offset = offset + 1
标签:etree,image,爬虫,li,HTML,offset,解析,html,页面
From: https://www.cnblogs.com/dxmstudy/p/18159758