首页 > 编程语言 >Python学习日记 xpath练习

Python学习日记 xpath练习

时间:2023-08-29 17:01:40浏览次数:38  
标签:xpath info Python resp num fromPageTitle print import 日记

import requests
from lxml import etree
import re
import random
import traceback
from time import sleep

# url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8700291432374701138&ipn=rj&ct=201326592&is=&fp=result&fr=ala&word=%E8%A1%A8%E6%83%85%E5%8C%85&queryWord=%E8%A1%A8%E6%83%85%E5%8C%85&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=©right=&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&expermode=&nojc=&isAsync=&pn=390&rn=30&gsm=186'
headers = {
    'Cookie':'winWH=%5E6_1920x963; BDIMGISLOGIN=0; BDqhfp=%E8%A1%A8%E6%83%85%E5%8C%85%26%26NaN-1undefined%26%268772%26%2614; BIDUPSID=47D1A97F74FE4D84D9C060A7E9D9623C; PSTM=1688450494; BAIDUID=64354928A148308F322D02D378FB19A4:FG=1; BAIDUID_BFESS=64354928A148308F322D02D378FB19A4:FG=1; ZFY=r0Ch4DZ4vzKkjKsCTr20yTyvBoJZR:BJjX3:AbIpxAvCs:C; BA_HECTOR=05812la52la52l80008505891ieo2c31p; PSINO=1; H_PS_PSSID=36548_39226_39223_39193_39199_39240_39233_26350_39238_39138_39224_39137_22157_39100; delPer=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BDRCVFR[tox4WRQ4-Km]=mk3SLVN4HKm; BDRCVFR[A24tJn4Wkd_]=mk3SLVN4HKm',
    'Host':'image.baidu.com',
    'Referer':'https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%B1%ED%C7%E9%B0%FC&fr=ala&ala=1&alatpl=normal&pos=0&',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
}
for i in range(1350,1440,30):
    sleep(random.uniform(1,2))
    num =hex(i)
    num = num[2:]
    url =f'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8700291432374701138&ipn=rj&ct=201326592&is=&fp=result&fr=ala&word=%E8%A1%A8%E6%83%85%E5%8C%85&queryWord=%E8%A1%A8%E6%83%85%E5%8C%85&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=©right=&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&expermode=&nojc=&isAsync=&pn={i}&rn=30&gsm={num}'
    data = {
        'tn': 'resultjson_com',
        'logid': '8700291432374701138',
        'ipn': 'rj',
        'ct': '201326592',
        'is': '',
        'fp': 'result',
        'fr': 'ala',
        'word': '表情包',
        'queryWord': '表情包',
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid': '',
        'st': '',
        'z': '',
        'ic': '',
        'hd': '',
        'latest': '',
        'copyright': '',
        's': '',
        'se': '',
        'tab': '',
        'width': '',
        'height': '',
        'face': '',
        'istype': '',
        'qc': '',
        'nc': '',
        'expermode': '',
        'nojc': '',
        'isAsync': '',
        'pn': f'{i}',
        'rn': '30',
        'gsm': f'{num}',
        
    }
    resp = requests.get(url , headers=headers,data=data)

    resp_json = resp.json()
    resp_urls = resp_json['data']

    for resp_url in resp_urls:
        sleep(random.uniform(0,1))
        try:
            fromPageTitle = resp_url['fromPageTitle']
            fromPageTitle = re.sub(r'[/\*?<>|\n-_ ]','',fromPageTitle)
            fromPageTitle = fromPageTitle[0:15]
            middleURL = resp_url['middleURL']
            name = re.split(r'(\w+)',middleURL)
            info = requests.get(middleURL)
            # print(name[-10],fromPageTitle,middleURL)
            with open (str(i)+fromPageTitle+'.'+name[-10],'wb') as f:
                f.write(info.content)
            print(str(i) + fromPageTitle + '下载完成')
        except Exception as e:
            traceback.print_exc()
            continue

这个是昨天做的百度表情包下载的程序

import traceback
# # for i in range(30,600,30):
# #     num = hex(i)
# #     num = num[2:]
# #     print(num)
# # from time import sleep
# # import random
# # for i in range(0,10):
# #     # x = random.uniform(1,2)
# #     sleep(random.uniform(0,1))
# #     print(random.uniform(0,1))
# for i in range(0,5):
#     try:
#         n =1/ i
#         print(n)
#     except Exception as e:
#         traceback.print_exc()


import requests
import csv
from lxml import etree
data = []
for i in range(0,250,25):
    url = f'https://movie.douban.com/top250?start={i}&filter='
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }
    resp = requests.get(url , headers=headers)
    html = etree.HTML(resp.text)
    html_info = html.xpath('//ol[@class="grid_view"]/li')
    # print(html_info)

    for info in html_info:
        # print(info)
        dic = {}
        dic['href'] = info.xpath('.//div[@class="hd"]/a/@href')[0]
        dic['title'] = info.xpath('.//div[@class="pic"]/a/img/@alt')[0]
        dic['pingjiashu'] = info.xpath('.//div[@class="star"]/span[4]/text()')[0]
        dic['content'] = info.xpath('.//div[@class="star"]/span[2]/text()')[0]
        data.append(dic)
    print(f'正在下载第{i}页')

with open('dbmovie.csv','a',encoding='utf-8',newline='')as f:
    w = csv.DictWriter(f,fieldnames=['title','href','pingjiashu','content'])
    w.writeheader()
    w.writerows(data)

xpath练习

标签:xpath,info,Python,resp,num,fromPageTitle,print,import,日记
From: https://blog.51cto.com/u_2469839/7278171

相关文章

  • 日记20230829
    膨胀节带单法兰与管口的配合方法:在罐区装配体中,将膨胀节带单法兰添加进入装配体,虽然没有配合到位,你仍然按照正常程序,设置管路特征,确定退出。生成一个管道装配体,这个装配体以膨胀节带单法兰开始,后面有一小段直管道。确认,编辑这个管道装配体。将膨胀节没有装法兰的一面和罐区装......
  • Day six of Python
    今日内容大概:if嵌套语法分支结构练习内容while循环while+break标志位的使用练习题while+countinue+else死循环for循环for+break+countinue+elseif嵌套语法: 练习:  循环结构:关键字:while while+break break跳出本层循环的含义 标志位的使用 练习题:猜年龄的游......
  • Python+Flask接口实现简单的ToKen功能
    话不多说,上代码fromflaskimportrequest,jsonifyfromfunctoolsimportwrapsclassTokenRequired:@classmethoddeftoken_required(cls,f):@wraps(f)defdecorated_function(*args,**kwargs):#获取请求头部中的key字段......
  • Python多进程实例
    python多进程实例废话不多说,直接上代码。#-*-coding:utf-8-*-frommultiprocessingimportPoolfrommultiprocessingimportProcessimportmultiprocessingimportnumpyasnpimportpandasaspd#通用dataframe切片后多进程异步执行方法defparallelize_dataf......
  • python字典中的值为列表
    python字典中的值为列表构造字典,字典中的值为列表。实例:vales=[13,12,11,3,4,5,20,30,31]ex=[0,0,0,1,1,2,2,2]#是对vales的分类结果我们需要将分类结果对应的值,放在一起,由此将使用字典,最为合适,而key就是分类标签,而value则为对应的数据。ex_dic={}for......
  • day①-python基础
    Python介绍发展史Python2or3?安装HelloWorld程序变量用户输入模块初识.pyc是个什么鬼?数据类型初识数据运算表达式if...else语句表达式for循环breakandcontinue 表达式while循环作业需求 一、Python介绍python的创始人为吉多·范罗苏姆(GuidovanR......
  • 这是一个基于threading可停止线程的有限容量有限并行度的python任务管理器
    这是一个可停止线程的有限容量有限并行度的任务管理器基于:GitHub-AlitaIcon/StopableThreadJob:可停止线程任务管理器QuickStart基础调用与效果importtimeimportdatetimefromloguruimportloggerfromStopableThreadJob.job_managerimportJobManagerif__name......
  • 当某个excel工作簿的某个单元格的值发生变动时, 自动执行python某脚本, 如何实现?
    要实现当Excel工作簿中的某个单元格值发生变化时自动执行Python脚本,你可以考虑以下步骤:监视单元格变化:首先,你需要实现监视Excel工作簿中的特定单元格是否发生了变化。这可以通过使用VBA(VisualBasicforApplications)宏来实现。打开Excel工作簿,按下ALT+F11打开VBA编辑器,然后......
  • python 实现图片压缩
    1、背景由于前面推流通过代代版本更新,目前停留在图片每一帧根据键的不同存进django框架自带的内存空间中,但是因为存在同时观看27个,甚至更多的情况,所以降低性能就显得尤为重要,虽然现在前端观看9个仅用20%CPU性能,后端也不高,但是也是要降2、python压缩图片的选择一、Pillow库Py......
  • 【Python-每日技巧】格式化输出的区别
    在Python中,字符串可以使用单引号(')或双引号(")括起来。如果字符串本身包含引号字符,你可以在字符串中使用反斜杠(\)进行转义。对于字符串中的美元符号($),它是一个普通的字符,不需要进行特殊处理。你可以直接在字符串中使用美元符号,如下所示:my_string="Thisisastringwitha$symbol."......