首页 > 编程语言 >7-python之数据解析xpath方法解析

7-python之数据解析xpath方法解析

时间:2024-12-19 10:29:46浏览次数:11  
标签:xpath GS1.2 python 0.0 22% 3A% ga 解析


前言

python得到的响应数据有几种类型:
    1.字节(图片 视频 音乐...)res.content
    2.json数据 (字典)res.json()
    3.html结构数据(正则,xpath表达式)

一、安装新的模块 lxml

pip install lxml
是一个html的文件解析器(解析html语法)
通过解析器从html数据中提取到想要的目标数据

二、使用步骤

1.引入库,得到对象

from lxml import etree
tree = etree.HTML(res.text)返回值是一个Element类型对象

代码如下(示例):

from lxml import etree
import requests

url = 'https://cs.lianjia.com/ershoufang/rs/'
headers = {
'accept-encoding':'gzip, deflate, br, zstd',
'accept-language':'zh-CN,zh;q=0.9,en;q=0.8',
'cache-control':'no-cache',
'connection':'keep-alive',
'cookie':'SECKEY_ABVK=tc6FrksX6MZJgTqLn1lLJUjBAKUoVte6icDYf/M+dcE%3D; BMAP_SECKEY=DYDUCbGdUQg5Tsz6NjERNdDm10dfx802Gu9dbU5BzDDkhpEnaJFE67hiIlZ038OWAtodrDMGe6SRaIqZTdCDJzckfF2PSeFqoJAGATYzPHKMxPP1gO8a4k684Vxj9DJYsAg7PMNbxBCE3KJBcuRT3M3snleqeuuq0Lx2tSAEEBFg_Khfn3vxyKTBVVJNGdOp; lianjia_uuid=8ed8e8c1-a6a6-494b-98e6-51916517293f; _smt_uid=662f5e90.84b8293; _ga=GA1.2.493340026.1714380435; _ga_BP33PMLH1S=GS1.2.1715866077.1.0.1715866077.0.0.0; _ga_QP8TFQJ8C6=GS1.2.1715866123.1.0.1715866123.0.0.0; _ga_RCTBRFLNVS=GS1.2.1716032264.1.0.1716032264.0.0.0; _ga_BGW2B8P0NN=GS1.2.1718093691.2.1.1718093707.0.0.0; _ga_F13TPWPVHB=GS1.2.1718093710.2.0.1718093710.0.0.0; _ga_DQQYKXJN8W=GS1.2.1720444569.1.0.1720444569.0.0.0; _ga_KJTRWRHDL1=GS1.2.1720447761.4.1.1720447768.0.0.0; _ga_QJN1VP0CMS=GS1.2.1720447761.4.1.1720447768.0.0.0; _ga_1DRRK8JCYW=GS1.2.1720527005.1.0.1720527005.0.0.0; _ga_654P0WDKYN=GS1.2.1720585771.1.1.1720585776.0.0.0; _ga_6F76ZVFRYC=GS1.2.1720585826.1.1.1720587657.0.0.0; _ga_6DHGZS4SHY=GS1.2.1720705579.16.0.1720705579.0.0.0; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1720446312,1720447767,1720585775,1720706291; _ga_WLZSQZX7DE=GS1.2.1720709821.11.0.1720709821.0.0.0; _ga_TJZVFLS7KV=GS1.2.1720709821.11.0.1720709821.0.0.0; select_city=430100; lianjia_ssid=6f4a0e5e-8e38-4aa7-bc3e-640019f81fec; Hm_lvt_46bf127ac9b856df503ec2dbf942b67e=1725363589; HMACCOUNT=13108745FF137EDD; _jzqc=1; _jzqy=1.1714380433.1725363589.1.jzqsr=baidu|jzqct=%E9%93%BE%E5%AE%B6.-; _jzqckmp=1; _qzjc=1; _gid=GA1.2.1248690604.1725363602; _jzqa=1.3848898517047963000.1714380433.1725363589.1725366304.25; _jzqx=1.1716032270.1725366304.5.jzqsr=bj%2Efang%2Elianjia%2Ecom|jzqct=/.jzqsr=cs%2Elianjia%2Ecom|jzqct=/ershoufang/rs/; hip=P5iTY_V8gds_zsn429V3n647hP7GgCMzA2XDw0lgA3j5Ou4SUSuJXTeN11CeHPWYas2A9qXwOZo8IQDTDoUpcv6M_mWM57eBoGIaiBrQUXvbt7aGfeAHd6bSfMY8UIwPVeYtZNx7-2hBOjY_QDiC8IP0gI4PMZUP9slM-g6uwuDWPI6FMhTMvaf7uQ%3D%3D; _gat=1; _gat_global=1; _gat_new_global=1; _gat_dianpu_agent=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218f2909654248a-0b1cfb5bddb2c9-26001d51-1024000-18f2909654310c3%22%2C%22%24device_id%22%3A%2218f2909654248a-0b1cfb5bddb2c9-26001d51-1024000-18f2909654310c3%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22wycs%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoxau%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; Hm_lpvt_46bf127ac9b856df503ec2dbf942b67e=1725366715; _qzja=1.1949231687.1714380432667.1725363589582.1725366304342.1725366710407.1725366714565.0.0.0.37.18; _qzjb=1.1725366304342.6.0.0.0; _qzjto=10.2.0; _jzqb=1.6.10.1725366304.1; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiOTE2Y2Y4YTk5NGNmNTBmNmU3NmViNDMwMGM3ZTFkOWJkODFlZDIzZmYxN2U4ZmY3MzI5NWI5N2YwZmM0YjkxMmU3MGFmYzdhZGM2MWE4YmM0YzdhZTdlNjM4ZTVhM2JlMjNjNTY5ZjkzYzMyZjAwMTlmMDUyZGFiNzkxZDY0Y2ZiZjg5ZTNiM2E0MjBhNmRhZGE5Njg5ZjdhY2M2OTdjZmE2N2RjOGZlZTIxMzM3YWFmZWM2YzIzZjlmNzY1MDNkYjU5OTYwMjUzNGFmZGJlYzczNGY4YTY0NzEwMmFiZjZiYjlkMWRkYzIyZDQ1ZjMwYjUyZTk4YzdmNjE1NzBkOVwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCI1MDdiNTEzMlwifSIsInIiOiJodHRwczovL2NzLmxpYW5qaWEuY29tL2Vyc2hvdWZhbmcvcnMvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=; _ga_4JBJY7Y7MX=GS1.2.1725366313.17.1.1725366718.0.0.0',
'host':'cs.lianjia.com',
'pragma':'no-cache',
'sec-ch-ua':'"Chromium";v="128", "Not;A=Brand";v="24", "Google Chrome";v="128"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'none',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
}
res = requests.get(url,headers=headers)
# 进行数据解析之前,应该确保数据得到的是正确
print(res.text)
tree = etree.HTML(res.text) # HTML(传入html数据) 返回值是一个Element类型对象

2.解析HTML数据-xpath使用规则

需要调用xpath
tree.xpath('xpath表达式')
xpath方法:返回值是列表数据

(1)斜杠使用介绍

print(tree.xpath('/html')) # [html标签对象]
最左侧的斜杠:xpath表达式一定要从根标签开始匹配标签
print(tree.xpath('/html/title')) 
非最左侧的斜杠:一个层级 html的儿子title
print(tree.xpath('//title'))
// 最左侧的双斜杠:从html中直接提取到标题(不管title是属于谁的儿子)
print(tree.xpath('//div//a'))
非最左侧的// 代表后代(儿子,孙子,重孙子....)标签

(2)属性限定查找标签

<开始标签 属性名=属性值>

用法:标签名[@属性名="属性值"]

print(len(tree.xpath('//a[@data-log_index="1"]')))

(3)获取标签内容//text()

能够获取所有span标签中的文本内容
print(tree.xpath("//div[@class='unitPrice']/span/text()"))
//text() 拿后代标签的所有文本内容

(4)获取某个标签的属性值@属性名

print(tree.xpath("//img[@class='lj-lazy']/@src"))

(5)如果只想获取到其中的某一个a标签对象

以下两种方式可以选择:

1-获取所有,再列表中利用索引取值:
tree.xpath("//div[@id='box']//a"))[0]  这里利用列表索引取出的是标签对象
2-直接在xpath中编写获取第几个 从1开始 
tree.xpath("//div[@id='box']//a[1]"))  取出的是列表

(6)当目标标签身上没有任何的属性,如何更加精确查找?

通过父级标签查找
print(len(tree.xpath("//div[@class='unitPrice']/span")))

三.实例解析

以链家网为例:

代码如下(示例):


import requests
from lxml import etree
headers = {
'accept-encoding':'gzip, deflate, br, zstd',
'accept-language':'zh-CN,zh;q=0.9,en;q=0.8',
'cache-control':'no-cache',
'connection':'keep-alive',
'cookie':'SECKEY_ABVK=tc6FrksX6MZJgTqLn1lLJUjBAKUoVte6icDYf/M+dcE%3D; BMAP_SECKEY=DYDUCbGdUQg5Tsz6NjERNdDm10dfx802Gu9dbU5BzDDkhpEnaJFE67hiIlZ038OWAtodrDMGe6SRaIqZTdCDJzckfF2PSeFqoJAGATYzPHKMxPP1gO8a4k684Vxj9DJYsAg7PMNbxBCE3KJBcuRT3M3snleqeuuq0Lx2tSAEEBFg_Khfn3vxyKTBVVJNGdOp; lianjia_uuid=8ed8e8c1-a6a6-494b-98e6-51916517293f; _smt_uid=662f5e90.84b8293; _ga=GA1.2.493340026.1714380435; _ga_BP33PMLH1S=GS1.2.1715866077.1.0.1715866077.0.0.0; _ga_QP8TFQJ8C6=GS1.2.1715866123.1.0.1715866123.0.0.0; _ga_RCTBRFLNVS=GS1.2.1716032264.1.0.1716032264.0.0.0; _ga_BGW2B8P0NN=GS1.2.1718093691.2.1.1718093707.0.0.0; _ga_F13TPWPVHB=GS1.2.1718093710.2.0.1718093710.0.0.0; _ga_DQQYKXJN8W=GS1.2.1720444569.1.0.1720444569.0.0.0; _ga_KJTRWRHDL1=GS1.2.1720447761.4.1.1720447768.0.0.0; _ga_QJN1VP0CMS=GS1.2.1720447761.4.1.1720447768.0.0.0; _ga_1DRRK8JCYW=GS1.2.1720527005.1.0.1720527005.0.0.0; _ga_654P0WDKYN=GS1.2.1720585771.1.1.1720585776.0.0.0; _ga_6F76ZVFRYC=GS1.2.1720585826.1.1.1720587657.0.0.0; _ga_6DHGZS4SHY=GS1.2.1720705579.16.0.1720705579.0.0.0; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1720446312,1720447767,1720585775,1720706291; _ga_WLZSQZX7DE=GS1.2.1720709821.11.0.1720709821.0.0.0; _ga_TJZVFLS7KV=GS1.2.1720709821.11.0.1720709821.0.0.0; select_city=430100; lianjia_ssid=6f4a0e5e-8e38-4aa7-bc3e-640019f81fec; Hm_lvt_46bf127ac9b856df503ec2dbf942b67e=1725363589; HMACCOUNT=13108745FF137EDD; _jzqc=1; _jzqy=1.1714380433.1725363589.1.jzqsr=baidu|jzqct=%E9%93%BE%E5%AE%B6.-; _jzqckmp=1; _qzjc=1; _gid=GA1.2.1248690604.1725363602; _jzqa=1.3848898517047963000.1714380433.1725363589.1725366304.25; _jzqx=1.1716032270.1725366304.5.jzqsr=bj%2Efang%2Elianjia%2Ecom|jzqct=/.jzqsr=cs%2Elianjia%2Ecom|jzqct=/ershoufang/rs/; hip=P5iTY_V8gds_zsn429V3n647hP7GgCMzA2XDw0lgA3j5Ou4SUSuJXTeN11CeHPWYas2A9qXwOZo8IQDTDoUpcv6M_mWM57eBoGIaiBrQUXvbt7aGfeAHd6bSfMY8UIwPVeYtZNx7-2hBOjY_QDiC8IP0gI4PMZUP9slM-g6uwuDWPI6FMhTMvaf7uQ%3D%3D; _gat=1; _gat_global=1; _gat_new_global=1; _gat_dianpu_agent=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218f2909654248a-0b1cfb5bddb2c9-26001d51-1024000-18f2909654310c3%22%2C%22%24device_id%22%3A%2218f2909654248a-0b1cfb5bddb2c9-26001d51-1024000-18f2909654310c3%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22wycs%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoxau%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; Hm_lpvt_46bf127ac9b856df503ec2dbf942b67e=1725366715; _qzja=1.1949231687.1714380432667.1725363589582.1725366304342.1725366710407.1725366714565.0.0.0.37.18; _qzjb=1.1725366304342.6.0.0.0; _qzjto=10.2.0; _jzqb=1.6.10.1725366304.1; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiOTE2Y2Y4YTk5NGNmNTBmNmU3NmViNDMwMGM3ZTFkOWJkODFlZDIzZmYxN2U4ZmY3MzI5NWI5N2YwZmM0YjkxMmU3MGFmYzdhZGM2MWE4YmM0YzdhZTdlNjM4ZTVhM2JlMjNjNTY5ZjkzYzMyZjAwMTlmMDUyZGFiNzkxZDY0Y2ZiZjg5ZTNiM2E0MjBhNmRhZGE5Njg5ZjdhY2M2OTdjZmE2N2RjOGZlZTIxMzM3YWFmZWM2YzIzZjlmNzY1MDNkYjU5OTYwMjUzNGFmZGJlYzczNGY4YTY0NzEwMmFiZjZiYjlkMWRkYzIyZDQ1ZjMwYjUyZTk4YzdmNjE1NzBkOVwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCI1MDdiNTEzMlwifSIsInIiOiJodHRwczovL2NzLmxpYW5qaWEuY29tL2Vyc2hvdWZhbmcvcnMvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=; _ga_4JBJY7Y7MX=GS1.2.1725366313.17.1.1725366718.0.0.0',
'host':'cs.lianjia.com',
'pragma':'no-cache',
'sec-ch-ua':'"Chromium";v="128", "Not;A=Brand";v="24", "Google Chrome";v="128"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'none',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
}
count = 1
for page in range(1,11):
    url = f'https://cs.lianjia.com/ershoufang/pg{page}/'
    res = requests.get(url, headers=headers)
    tree = etree.HTML(res.text)
    lis = tree.xpath('//ul[@class="sellListContent"]/li')
    for li in lis:
        #    第一次循环 li = 第一个li标签对象(第一个房子信息)
        #   . 代表当前标签
        #     标题
        title = li.xpath('.//div[@class="title"]/a/text()')[0]
        # 价格
        price = li.xpath('.//div[@class="totalPrice totalPrice2"]//text()')
        price = ''.join(price).replace(' ', '')
        # 房屋结构
        houseInfo = li.xpath('.//div[@class="houseInfo"]/text()')
        houseInfo = ''.join(houseInfo)
        # 地址信息
        positionInfo = li.xpath('.//div[@class="positionInfo"]//text()')
        positionInfo = ''.join(positionInfo).replace(' ', '')
        print(count, title, price, houseInfo, positionInfo)
        count += 1
    print(f'当前是第{page}页')


总结

注意特殊情况:
有时候参照浏览器的元素面板找标签可以正确找到,但是python通过xpath表达式找不到
当前元素中的内容是由多个请求响应到一起的,而python是单个请求,以响应内容为主

本文是针对xpath方法使用的简单介绍。XPath,网页数据的导航者,简洁的语法蕴含着强大的选择力量。

标签:xpath,GS1.2,python,0.0,22%,3A%,ga,解析
From: https://blog.csdn.net/2401_87633706/article/details/144473636

相关文章

  • Python中创建使用本地包
    在项目中创建本地包,需要以下几个步骤:1.创建包目录2.在包目录中创建一个init.py文件3.在包目录中创建模块文件4.使用包中的模块下面详细介绍操作步骤1.创建包目录包目录用来存放本包相关的代码。包目录是一个普通的文件夹,但是它包含了一些特定文件和结构,使得Python能够......
  • Python_面向对象-6
    文章目录Python面向对象基础今日内容介绍**①:面向过程和面向对象都可以实现代码重用和模块化编程,只不过面向对象的模块化更深,数据也更封闭和安全。****②:面向对象的思维方式更加贴近现实生活,更容易解决大型的复杂的业务逻辑。****③:从前期开发的角度来看,面向对象比......
  • 《python的数学函数绘图软件》毕业设计项目
    大家好,我是俊星学长,一名在Java圈辛勤劳作的码农。今日,要和大家分享的是一款《python的数学函数绘图软件》毕业设计项目。项目源码以及部署相关事宜,请联系俊星学长,文末会附上联系信息哦。......
  • 2024实测验证可用的股票数据接口集合.:python、JavaScript 、JAVA等实例代码演示教你如
    实测可用的股票数据接口,可以直接点击在浏览器中验证:沪深两市股票列表API接口链接(可点击验证):https://api.mairui.club/hslt/list/b997d4403688d5e66a【实时数据接口】沪深两市实时交易数据接口API接口链接(可点击验证):https://api.mairui.club/hsrl/ssjy/000001/b997d4403......
  • Oracle 数据库 dblink 使用全解析
    一、引言在企业级数据库应用场景中,常常需要在不同的Oracle数据库实例之间进行数据交互与共享。Oracle的数据库链接(dblink)功能为此提供了便捷的解决方案,它允许用户如同访问本地数据库对象一样操作远程数据库中的数据。二、dblink的创建语法格式创建dblink的基本语......
  • 超详细教程:手把手教你在 App Store 添加内购功能(从零开始到上线)” “新手必看!一文搞定
    目录什么是内购功能(In-AppPurchase)?实现内购功能前的准备工作(1)启用内购功能的前置条件(2)创建AppID并启用内购权限在AppStoreConnect中添加内购项目(1)内购类型的选择与区别(2)创建内购商品并填写相关信息使用代码实现内购功能测试内购功能(1)创建沙盒测试账号(2......
  • 动态数据源 @DS 注解源码解析
    参考:动态数据源切换——@DS注解源码解析前言借助dynamic-datasource可实现多数据源读写,其核心注解@DS用来动态切换数据源。下面介绍@DS注解的实现原理。如何使用在pom中引入依赖:<!--spring-boot1.5.x2.x.x--><dependency><groupId>com.baomidou</groupId>......
  • Python知识分享第二十九天-PyMySQL
    PyMySQL介绍:概述:它是Python的1个库(模块),可以实现通过Python代码,操作MySQL数据库.该库需要手动安装一下.安装方式:方式1:导包时自动安装.方式2:在PyCharm的Settings->Python编辑器或者Anaconda->安装方式3:通过pip方式,在命令行中......
  • 常用于优化算法测试的python非凸函数有哪些?
            在优化算法领域,有一些常用的测试函数,它们被广泛用于评估和比较不同优化算法的性能。        非凸函数是指在其定义域内至少存在一个点,使得该点的任意邻域内函数值不满足凸性条件的函数。换句话说,非凸函数在其定义域内至少存在一个点,使得函数的图像在......
  • Python Tkinter 弹窗美化指南
    在Python编程中,Tkinter是标准GUI(图形用户界面)库,它允许开发者创建桌面应用程序。尽管Tkinter提供了基本的窗口和控件功能,但默认的样式和外观往往显得单调。因此,对Tkinter弹窗进行美化是提升用户体验的重要步骤。本文将详细介绍如何使用Tkinter创建并美化弹窗,包括理论概述和详细的代......