爬虫之数据解析2

标签：url 22% list 爬虫 li div 解析数据 class

数据解析2

一、xpath结合分页爬虫

还是那上次我们讲的二手房的例子, 上次我们实战做了用xpath爬取二手房下面的标题, 房子信息, 总价, 单价, 大致地址信息。那这里我们就需要结合分页爬虫来实现爬取更多的数据。

当我们进入二手房网页的时候, 默认是第一页, 那第一页的请求的url是https://cs.lianjia.com/ershoufang/pg1/, 我们再点击最下面的页面按钮, 点击第二页, 我们可以发现, 请求的url就变成了https://cs.lianjia.com/ershoufang/pg2/。如图:

第一页:

第二页:

我们可以发现, 请求的url的规律是https://cs.lianjia.com/ershoufang/pg{页数}/。

那就好办了, 我们可以用我们之前学习过的分页爬虫的知识来解决啦!!!

分页爬虫在之前有讲到过很多次哦, 如果不熟悉的小伙伴们, 可以去看一看我之前几篇写的博客。

from lxml import etree

import requests
import requests

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    'cookie': "lianjia_uuid=0741e41c-75be-4e7b-9bd0-ee4002203371; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22192752ebbdd1116-0ed7d9ff3e4d3e-26001051-1474560-192752ebbded87%22%2C%22%24device_id%22%3A%22192752ebbdd1116-0ed7d9ff3e4d3e-26001051-1474560-192752ebbded87%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; _ga=GA1.2.1019355060.1728542865; _jzqa=1.489454228011104260.1728542851.1728548018.1728557075.4; _jzqx=1.1728543854.1728557075.1.jzqsr=hip%2Elianjia%2Ecom|jzqct=/.-; _qzja=1.151913534.1728542860751.1728548018319.1728557075265.1728557075265.1728557908126.0.0.0.7.4; _ga_4JBJY7Y7MX=GS1.2.1728557086.3.1.1728557928.0.0.0; lianjia_ssid=f3c8b7c1-375c-4f14-a4c9-acbb379c4cb4; hip=3q1TIMAzuiGCFUUH5zDyksWsjn9m0gEdHu5fF1eVR7-AhrbKmrDXh1c00aDV_L4EOrtKOjOc529AmjCFX-9Cm-5xl1Tc3u5atBBAzWdOdVtFwdT0zN7_tTl0zrqIyxFCzHN-K5dZCDEFX6PljnmkIBqgC5er6ldLmXVRCPJjLW-BVhHj9Au9XKg3Zg%3D%3D; select_city=430100; Hm_lvt_46bf127ac9b856df503ec2dbf942b67e=1728542848,1728543853,1728557075,1728707798; HMACCOUNT=7AB3E94A75916BE3; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiYmIwMmYyYzU1ODZjMjNhZGJjOGVmZTZmYmEyYzVlODRjNTgwZjJmZGZlZTU4MzJhYmM0OTFiMDJiOGVhNDQxNTY4N2NlOWU4ZDQ3OTMxN2ZhYjFlMTczZTg5NzI1ZDg0YjQxZGY4ZWFlOGIxYzg3YzU2MjFlNTZlMWI0OWJjMzI3NmExOTlmOTY0YzhmOWE2ZWFhYWU2NTUyYjAzMmJjNWJiMjNkYmNiODRmNTBhYjg5NmNlOTNmNTA0MmY0ODdkNjg2MDQ5YTk5ODRmNGNmOTUwODkxNmVmOTZjMTdjYmI2MmZmYTI1NDBlYTZkOWU5MDMxNTk4ZjYyZjJlMDk3Y1wiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCJkZTgyMTgyMlwifSIsInIiOiJodHRwczovL2NzLmxpYW5qaWEuY29tL2Vyc2hvdWZhbmcvcGcxLyIsIm9zIjoid2ViIiwidiI6IjAuMSJ9; Hm_lpvt_46bf127ac9b856df503ec2dbf942b67e=1728709146"
}
count = 0
for page in range(1, 6):
    url = f'https://cs.lianjia.com/ershoufang/pg{page}/'
    res = requests.get(url, headers=headers)
    tree = etree.HTML(res.text)
    lis = tree.xpath('//ul[@class="sellListContent"]/li')  # 30个房屋信息的整体
    # print(lis)
    # print(len(lis))
    for li in lis:
        #     第一次循环 li=第一个房子信息的整体对象
        # 第一次循环，li.xpath 通过编写的xpath语法 从当前第一个li标签中去匹配内容
        # 配合.进行使用：代表当前标签
        title = li.xpath('.//div[@class="title"]/a/text()')[0]
        pirce = li.xpath('.//div[@class="totalPrice totalPrice2"]//text()')
        # [' ', '220', '万'] ---》 220万
        pirce = ''.join(pirce)
        # 获取单价，地址，户型信息
        #   单价
        unitPrice = li.xpath('.//div[@class="unitPrice"]/span/text()')[0]
        # 地址
        info_ls = li.xpath('.//div[@class="positionInfo"]//text()')
        info_str = ''.join(info_ls)
        info_str = info_str.replace(' ', '')
        # 户型
        houseInfo = li.xpath('.//div[@class="houseInfo"]/text()')[0]
        count += 1
        print(count, title, pirce, unitPrice, info_str, houseInfo)

    print(f'当前是第{page}页数据已经全部获取成功')

这里我们使用的分页查询, 没有使用死循环直到没有数据了结束循环的那个策略, 我们就简单的使用了for循环, 如果想要研究while true, 判断是否还有数据来决定是否继续爬虫, 这个就留给大家去研究啦, 思路和之前爬虫文章中讲的方法是一样的哦。注意:cookie的信息随时会变换, 如果爬虫发现获取到的数据缺失或者有其他的问题的话, 需要更换cookie, cookie就在请求当中就可以找到。

寻找cookie:

找到对应请求的url的标头, 往下翻找到cookie就可以了, 然后把这段cookie已经对应的值写到代码的headers里面去。

结果:

二、通过xpath实现城市找房

url是https://www.lianjia.com/city/, 我们先定义一个字典, 用来保存城市数据。先对url发起请求。还是和刚才一样使用简单的for循环分页。

import requests
from lxml import etree

city_name_input = input('请输入你要搜索的城市房屋信息')  # 长沙
# 获取所有城市
city_url = 'https://www.lianjia.com/city/'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    'cookie': 'lianjia_uuid=0741e41c-75be-4e7b-9bd0-ee4002203371; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22192752ebbdd1116-0ed7d9ff3e4d3e-26001051-1474560-192752ebbded87%22%2C%22%24device_id%22%3A%22192752ebbdd1116-0ed7d9ff3e4d3e-26001051-1474560-192752ebbded87%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; _ga=GA1.2.1019355060.1728542865; _jzqx=1.1728543854.1728557075.1.jzqsr=hip%2Elianjia%2Ecom|jzqct=/.-; _ga_4JBJY7Y7MX=GS1.2.1728557086.3.1.1728557928.0.0.0; hip=3q1TIMAzuiGCFUUH5zDyksWsjn9m0gEdHu5fF1eVR7-AhrbKmrDXh1c00aDV_L4EOrtKOjOc529AmjCFX-9Cm-5xl1Tc3u5atBBAzWdOdVtFwdT0zN7_tTl0zrqIyxFCzHN-K5dZCDEFX6PljnmkIBqgC5er6ldLmXVRCPJjLW-BVhHj9Au9XKg3Zg%3D%3D; select_city=430100; Hm_lvt_46bf127ac9b856df503ec2dbf942b67e=1728542848,1728543853,1728557075,1728707798; HMACCOUNT=7AB3E94A75916BE3; lianjia_ssid=1f3035d0-cd6a-4fb7-8f3f-f936defe9d17; Hm_lpvt_46bf127ac9b856df503ec2dbf942b67e=1728714289; _jzqa=1.489454228011104260.1728542851.1728557075.1728714311.5; _jzqc=1; _jzqckmp=1; _qzja=1.1533727166.1728714310904.1728714310904.1728714310905.1728714310904.1728714310905.0.0.0.1.1; _qzjc=1; _qzjto=1.1.0; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiYmIwMmYyYzU1ODZjMjNhZGJjOGVmZTZmYmEyYzVlODRjMTY3MmI3ZmM5ZmI5YmMxZGQzNDQwMzFhYjZmZmQ4MDIzOWY1NTM0ZDJhNmY0MzQ5YzRkZmRhOWVkZGJkNjVhMTNiNDA5MGY1ZmY2ZDNlZGMwYzM4YjFkZmEwNDNmMzhlMjVjNTgxOGJiNWRhMGE4N2ZkNGNjYjhlYzMxN2Q2NmY5ZDFjMTVjMGY0OWQ0OWYwZWNiMDAxZjYxZGMxOWQ0NGIxMDZkZmQ3ZGM0OTkzMzJiYTQwYzVkM2RmYTU4ZWFjYWRjZjYzYzA4MDkwNjgxM2EzZmQxYjg3OTQ0MmYxMFwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCJkMGRmMTgyM1wifSIsInIiOiJodHRwczovL3d3dy5saWFuamlhLmNvbS9jaXR5LyIsIm9zIjoid2ViIiwidiI6IjAuMSJ9; _jzqb=1.1.10.1728714311.1; _qzjb=1.1728714310904.1.0.0.0; _gid=GA1.2.1870308340.1728714317; _ga_TJZVFLS7KV=GS1.2.1728714317.1.0.1728714317.0.0.0; _ga_WLZSQZX7DE=GS1.2.1728714317.1.0.1728714317.0.0.0'
}
city_res = requests.get(city_url, headers=headers)
# 解析城市的名字 城市的url
# print(city_res.text)
tree = etree.HTML(city_res.text)
# 根据页面层级编写xpath
lis = tree.xpath('//div[@class="city_province"]/ul/li')
# print(len(lis))
# 定义空字典，保存城市数据
city_dict = {}

提取城市数据并把相应的url, 并记录到字典当中。

for li in lis:
    # 城市名
    city_name = li.xpath('./a/text()')[0]
    # 城市url
    city_url_2 = li.xpath('./a/@href')[0]
    # if city_name_input == city_name:
    # #     发请求  字典={城市名:城市url}
    # print(city_name,city_url_2)
    # 字典添加数据的语法：字典名[键] = 值
    city_dict[city_name] = city_url_2

判断当前输入的城市名有没有在字典中。(这里还是使用分页爬虫的方法)

# 判断当前输入的城市名有没有在字典中
# 长沙 in city_dict 键是否存在
count = 0
if city_name_input in city_dict:
    #     从字典中根据输入的名字获取到城市的url
    # 根据键获取值
    city_url = city_dict[city_name_input]
    print(city_url)
    #     发起请求
    # https://cs.lianjia.com/
    # https://{city_url}.lianjia.com/ershoufang/pg{page}/
    for page in range(1, 6):  # 分页
        city_res = requests.get(f'{city_url}ershoufang/pg{page}/', headers=headers)
        #     数据解析
        tree = etree.HTML(city_res.text)
        lis = tree.xpath('//ul[@class="sellListContent"]/li')  # 30个房屋信息的整体
        # print(lis)
        # print(len(lis))
        for li in lis:
            #     第一次循环 li=第一个房子信息的整体对象
            # 第一次循环，li.xpath 通过编写的xpath语法 从当前第一个li标签中去匹配内容
            # 配合.进行使用：代表当前标签
            title = li.xpath('.//div[@class="title"]/a/text()')[0]
            pirce = li.xpath('.//div[@class="totalPrice totalPrice2"]//text()')
            # [' ', '220', '万'] ---》 220万
            pirce = ''.join(pirce)
            # 获取单价，地址，户型信息
            #   单价
            unitPrice = li.xpath('.//div[@class="unitPrice"]/span/text()')[0]
            # 地址
            info_ls = li.xpath('.//div[@class="positionInfo"]//text()')
            info_str = ''.join(info_ls)
            info_str = info_str.replace(' ', '')
            # 户型
            houseInfo = li.xpath('.//div[@class="houseInfo"]/text()')[0]
            count += 1
            print(count, "\n标题:", title, "\n总价:", pirce, "\n单价:", unitPrice, "\n大致定位:", info_str, "\n房子信息:", houseInfo)

        print(f'当前是第{page}页数据已经全部获取成功')
else:
    print('此城市没有房子数据')

结果:

在这里插入图片描述

三、学习数据解析的第二种方式(BeautifulSoup)

在我们使用BeautifulSoup之前, 我们需要安装第三方库, 打开cmd, 输入以下命令:
pip install beautifulsoup4
安装成功以后, 我们在代码里面导入相应模块:
# 创建Beautifulsoup对象
from bs4 import BeautifulSoup
接下来, 我们就可以使用BeautifulSoup对象了。

获取html响应(我们还是使用上次保存好的html文件):
with open('链家.html', 'r', encoding='utf-8') as f:
    html_code = f.read()
bs = BeautifulSoup(html_code, 'lxml')
获取标签对象(bs4)

bs对象.标签名返回值：标签对象
print(bs.title)
# 只会获取到第一个标签
print(bs.div)
注意:用这种办法, 只会获取到第一个标签。

bs对象.find(标签名) 返回值：标签对象
print(bs.find('title'))
bs对象.find(标签名,属性名=属性值)class=“title” , 如果要通过class做限定，使用class_。
print(bs.find('div',class_='title'))
添加限制条件href=“https://cs.lianjia.com/ershoufang/104113837527.html”, 根据href进行筛选标签。
print(bs.find('a',href="https://cs.lianjia.com/ershoufang/104113837527.html"))
bs.对象.findAll(标签名) 返回值：长得像列表的一种数据类型做列表处理就好
print(bs.findAll('title'))
print(bs.findAll('a',href="https://cs.lianjia.com/ershoufang/104113837527.html"))
获取页面中所有的div class=title
print(bs.findAll('div',class_='title'))
层级结构获取

bs对象.select(css选择器) 返回值：长得像列表的一种数据类型做列表处理就好
print(bs.select('.totalPrice>span'))
常见的用法:
'''
id选择器
<div id="box"></div>
#id值  #box
bs.select("#box")

class选择器
<div id="box" class="abc"></div>
.class值  .abc
bs.select(".abc")


标签名选择器
<div id="box" class="abc"></div>
标签名  div
bs.select("div")


子代选择器
<div  class="abc">
    <div>
        <div></div>
    </div>
</div>
.abc>div>div
bs.select(".abc>div>div")

后代选择器 
<div  class="abc">
    <div>
        <div><div><div></div></div></div>
    </div>
</div>
.abc div
bs.select(".abc div")



bs对象.标签名
bs对象.find(标签名,属性名=属性值)
bs对象.findAll(标签名,属性名=属性值)
bs对象.select(css选择器)

'''
获取标签属性标签对象[属性名]
# 获取标签属性 标签对象[属性名]
# .title>a指的是类为title的标签里面的a标签 如<div class="title"><a href="www.baidu.com">链接</a></div>
a = bs.select('.title>a')
# print(a)
for i in a:
    # href是属性
    print(i['href'])
结果:

四、利用BeautifulSoup来分析二手房网站的数据

获取所有房屋的li标签

from bs4 import BeautifulSoup

with open('链家.html', 'r', encoding='utf-8') as f:
    html_code = f.read()
bs = BeautifulSoup(html_code, 'lxml')

# 获取所有房屋的li标签
lis = bs.select('.sellListContent>li')

然后再依次获取在li标签里面我们想要的数据。

count = 1
for li in lis:
    # 第一次循环 获取第一个li标签
    # 标题
    title = li.select('.title>a')[0].text
    # 总价
    price = li.select('.totalPrice>span')[0].text + '万'
    # 大致定位
    info = li.select('.positionInfo')[0].text.replace(' ', '')
    # 房子信息
    houseInfo = li.select('.houseInfo')[0].text.replace(' ', '')
    # 单价
    unitPrice = li.select('.unitPrice>span')[0].text.replace(',', '')
    print(count, title, price, info, houseInfo, unitPrice)
    count += 1

思路也和上次的xpath差不多, 先获取到大范围的数据, 然后再逐一获取小范围里面自己想要的数据, 这样可以避免数据对应不上的情况。

完整代码:

from bs4 import BeautifulSoup

with open('链家.html', 'r', encoding='utf-8') as f:
    html_code = f.read()
bs = BeautifulSoup(html_code, 'lxml')

# 获取所有房屋的li标签
lis = bs.select('.sellListContent>li')
count = 1
for li in lis:
    # 第一次循环 获取第一个li标签
    title = li.select('.title>a')[0].text
    price = li.select('.totalPrice>span')[0].text + '万'
    info = li.select('.positionInfo')[0].text.replace(' ', '')
    houseInfo = li.select('.houseInfo')[0].text.replace(' ', '')
    unitPrice = li.select('.unitPrice>span')[0].text.replace(',', '')
    print(count, title, price, info, houseInfo, unitPrice)
    count += 1

结果:

在这里插入图片描述

BeautifulSoup和xpath都能实现一样的效果, 在实战当中, 我们可以可以选择擅长自己的方法去爬虫, 爬取我们想要的数据。

五、实战:

爬取星座运程的相关信息: 星座名称, 星座时间, 综合运势, 爱情运势, 财富运势, 健康运势。并且讲数据保存到json文件里面。

已知信息:星座图片网址, 相应的请求url

星座图片网址

img_url = ["http://43.143.122.8/img/白羊座.png", "http://43.143.122.8/img/金牛座.png", "http://43.143.122.8/img/双子座.png",
           "http://43.143.122.8/img/巨蟹座.png",
           "http://43.143.122.8/img/狮子座.png", "http://43.143.122.8/img/处女座.png", "http://43.143.122.8/img/天秤座.png",
           "http://43.143.122.8/img/天蝎座.png",
           "http://43.143.122.8/img/射手座.png", "http://43.143.122.8/img/摩羯座.png", "http://43.143.122.8/img/水瓶座.png",
           "http://43.143.122.8/img/双鱼座.png"]  # 星座图片网址

相应的请求url

url = "https://www.1212.com/luck/"

完整代码:

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings()

headers = {"User-Agent": "Mozilla/5.0(Windows NT 10.0;"
                         "Win64;x64)AppleWebKit/537.36"
                         "(KHTML,like Gecko)Chrome/71.0.3578.9"
                         "8 Safari/537.76", "Connection": "close"}

url = 'https://www.1212.com/luck/'  # 网址
res = requests.get(url, headers=headers, timeout=40, stream=True, verify=False)
res.encoding = "utf-8"
soup = BeautifulSoup(res.text, 'lxml')
url_list = []

file = open("C:/Users/Administrator/Desktop/zodiac_signs.json", "w", encoding='utf-8')


def write_json_file(L1, L2, L3, L4, L5, L6, L7, L8):
    """
    将爬取到的内容写入json文件
    :param L1: name_list
    :param L2: date_time_list
    :param L3: img_url
    :param L4: comprehensiveFortunes_list
    :param L5: loveFortunes_list
    :param L6: careersAndStudies_list
    :param L7: wealthFortune_list
    :param L8: healthFortunes_list
    """
    for i in range(len(url_list)):
        if i == 0:
            file.write('{\n\t"code":200,\n')
            file.write('\t"msg":"success",\n')
            file.write('\t"total":'+str(len(url_list))+',\n')
            file.write('\t"data":[\n')
        file.write('\t\t{\n\t\t\t"id":' + str(i+1) + ",\n")
        file.write('\t\t\t"name":' + '"' + str(L1[i]) + '"' + ",\n")
        file.write('\t\t\t"dateTime":' + '"' + str(L2[i]) + '"' + ",\n")
        file.write('\t\t\t"imgUrl":' + '"' + str(L3[i]) + '"' + ",\n")
        file.write('\t\t\t"comprehensiveFortunes":' + '"' + str(L4[i]) + '"' + ",\n")
        file.write('\t\t\t"loveFortunes":' + '"' + str(L5[i]) + '"' + ",\n")
        file.write('\t\t\t"careersAndStudies":' + '"' + str(L6[i]) + '"' + ",\n")
        file.write('\t\t\t"wealthFortune":' + '"' + str(L7[i]) + '"' + ",\n")
        file.write('\t\t\t"healthFortunes":' + '"' + str(L8[i]) + '"' + "\n")
        if i != len(url_list) - 1:
            file.write("\t\t},\n")
        else:
            file.write("\t\t}\n")
            file.write("\t]\n")
            file.write("}")


for i in range(12):
    url_list.append("https://www.1212.com" +
                    soup.find_all('div', class_="daily-luck-body")[0].find_all_next('ul')[0].find_all_next('li')[
                        i].find_all_next('a')[0].get("href"))

res.close()

name_list = []  # 星座名称
date_time_list = []  # 星座时间
img_url = ["http://43.143.122.8/img/白羊座.png", "http://43.143.122.8/img/金牛座.png", "http://43.143.122.8/img/双子座.png",
           "http://43.143.122.8/img/巨蟹座.png",
           "http://43.143.122.8/img/狮子座.png", "http://43.143.122.8/img/处女座.png", "http://43.143.122.8/img/天秤座.png",
           "http://43.143.122.8/img/天蝎座.png",
           "http://43.143.122.8/img/射手座.png", "http://43.143.122.8/img/摩羯座.png", "http://43.143.122.8/img/水瓶座.png",
           "http://43.143.122.8/img/双鱼座.png"]  # 星座图片网址
comprehensiveFortunes_list = []  # 综合运势
loveFortunes_list = []  # 爱情运势
careersAndStudies_list = []  # 事业学业
wealthFortune_list = []  # 财富运势
healthFortunes_list = []  # 健康运势
global res2
global res3
for i in range(len(url_list)):
    res2 = requests.get(url_list[i], headers=headers, timeout=40, verify=False)
    res2.encoding = "utf-8"
    soup2 = BeautifulSoup(res2.text, 'lxml')
    url3 = 'https://www.1212.com/luck/'  # 网址
    res3 = requests.get(url, headers=headers, timeout=40, verify=False)
    res3.encoding = "utf-8"
    soup3 = BeautifulSoup(res3.text, 'lxml')
    name_list.append(soup2.find_all('div', class_="xzxzmenu")[0].find_all_next('p')[0].find_all_next('em')[0].text)
    date_time_list.append(
        soup3.find_all('div', class_="daily-luck-body")[0].find_all_next('ul')[0].find_all_next('li')[i].find_all_next(
            'div', class_="con")[0].find_all_next('span', class_="time")[0].text.replace("(", "").replace(")", ""))
    comprehensiveFortunes_list.append(
        soup2.find_all('div', class_="infro-list")[0].find_all_next('dl')[0].find_all_next('div', class_="jzbox")[
            0].text)
    loveFortunes_list.append(
        soup2.find_all('div', class_="infro-list")[0].find_all_next('dl')[1].find_all_next('div', class_="jzbox")[
            0].text)
    careersAndStudies_list.append(
        soup2.find_all('div', class_="infro-list")[0].find_all_next('dl')[2].find_all_next('div', class_="jzbox")[
            0].text)
    wealthFortune_list.append(
        soup2.find_all('div', class_="infro-list")[0].find_all_next('dl')[3].find_all_next('div', class_="jzbox")[
            0].text)
    healthFortunes_list.append(
        soup2.find_all('div', class_="infro-list")[0].find_all_next('dl')[4].find_all_next('div', class_="jzbox")[
            0].text)
print(name_list)
print(date_time_list)
print(comprehensiveFortunes_list)
print(loveFortunes_list)
print(careersAndStudies_list)
print(wealthFortune_list)
print(healthFortunes_list)
write_json_file(name_list, date_time_list, img_url, comprehensiveFortunes_list, loveFortunes_list,
                careersAndStudies_list, wealthFortune_list, healthFortunes_list)
res2.close()
res3.close()
file.close()

以上就是爬虫数据解析2的所有内容了, 如果有哪里不懂的地方,可以把问题打在评论区, 欢迎大家在评论区交流!!!
如果我有写错的地方, 望大家指正, 也可以联系我, 让我们一起努力, 继续不断的进步.
学习是个漫长的过程, 需要我们不断的去学习并掌握消化知识点, 有不懂或概念模糊不理解的情况下,一定要赶紧的解决问题, 否则问题只会越来越多, 漏洞也就越老越大.
人生路漫漫, 白鹭常相伴!!!

标签：url,22%,list,爬虫,li,div,解析,数据,class
From： https://blog.csdn.net/m0_55297736/article/details/143030583

数据解析2

目录

1.补充上次讲的xpath, 并结合分页爬虫

2.通过xpath实现城市找房

3.学习数据解析的第二种方式(BeautifulSoup)

4.利用BeautifulSoup来分析二手房网站的数据

5.实战

一、xpath结合分页爬虫

二、通过xpath实现城市找房

三、学习数据解析的第二种方式(BeautifulSoup)

获取标签对象(bs4)

bs对象.标签名返回值：标签对象

bs对象.find(标签名) 返回值：标签对象

层级结构获取

四、利用BeautifulSoup来分析二手房网站的数据

五、实战:

相关文章

赞助商

阅读排行

爬虫之数据解析2

数据解析2

目录

1.补充上次讲的xpath, 并结合分页爬虫

2.通过xpath实现城市找房

3.学习数据解析的第二种方式(BeautifulSoup)

4.利用BeautifulSoup来分析二手房网站的数据

5.实战

一、xpath结合分页爬虫

二、通过xpath实现城市找房

三、学习数据解析的第二种方式(BeautifulSoup)

获取标签对象(bs4)

bs对象.标签名 返回值：标签对象

bs对象.find(标签名) 返回值：标签对象

层级结构获取

四、利用BeautifulSoup来分析二手房网站的数据

五、实战:

相关文章

赞助商

阅读排行

bs对象.标签名返回值：标签对象