03. 淘宝

03. 淘宝

时间：2022-11-20 00:01:03浏览次数：35

标签：__ 03 start url self item 淘宝 data

csv 存储

import requests
import re
import json
from scrapy.exporters import CsvItemExporter
from concurrent.futures import ThreadPoolExecutor, as_completed, ProcessPoolExecutor


class TBSpider(object):
    executor = ThreadPoolExecutor(max_workers=8)

    def __init__(self):
        # 用户输入检索商品名称
        self.user_input = '香水'

        # 请求头
        self.headers = {

        }

        # 文件存储初始化操作
        self.file = open(f'{self.user_input}.xls', 'wb')
        self.exporter = CsvItemExporter(file=self.file, include_headers_line=False, encoding='gbk')
        self.exporter.start_exporting()

    """发送请求，获取响应"""
    def parse_start_url(self):
        all_tasks = []
        for i in range(1, 100):
            start_url = f'https://s.taobao.com/search?spm=a21bo.jianhua.201867-main.5.1c1611d9bwOdR9&q={self.user_input}&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=-8&ntoffset=-8&p4ppushleft=2%2C48&s={i*44}'
            try:
                # 发起请求
                response = requests.get(url=start_url, headers=self.headers).content.decode('utf-8')
                task = self.executor.submit(self.parse_start_url, i, 2)
                all_tasks.append(task)

                for result in as_completed(all_tasks):
                    exception = result.exception()
                    if exception:

                        self.parse(response)
            except BaseException as e:
                print(e)

    def parse(self, response):
        global num
        data = re.findall('g_page_config = (.*?)g_srp_loadCss', response, re.S)[0]
        if data:
            data = data[0:-6]
            json_dict = json.loads(data)

            li = json_dict['mods']['itemlist']['data']['auctions']
            if li:
                for i in li:
                    # 标题
                    raw_title = i['raw_title']
                    # 价格
                    view_price = i['view_price']
                    # 发货地
                    item_loc = i['item_loc']
                    # 付款人数
                    comment_count = i['comment_count']
                    # 店铺名
                    nick = i['nick']
                    # 店铺链接
                    shop_link = i['shopLink']
                    if 'https:' not in shop_link:
                        shop_link = 'https:' + shop_link

                    # 数据封装
                    detail_data = {
                        'num': num,
                        'raw_title': raw_title,
                        'view_price': view_price,
                        'item_loc': item_loc,
                        'comment_count': comment_count,
                        'nick': nick,
                        'shopLink': shop_link,
                    }
                    num += 1
                    print(detail_data)

                    self.exporter.export_item(detail_data)

    def __del__(self):
        self.exporter.finish_exporting()
        self.file.close()


if __name__ == '__main__':
    # 定义计数
    num = 1
    s = TBSpider()
    s.parse_start_url()

标签：__,03,start,url,self,item,淘宝,data
From： https://www.cnblogs.com/modly/p/16907576.html

ModuleNotFoundError: No module named 'gffutils'
001、问题>>>importgffutilsTraceback(mostrecentcalllast):File"<stdin>",line1,in<module>ModuleNotFoundError:Nomodulenamed'gffutils' ......
Xilinx Zynq-7000系列XC7Z035/XC7Z045高性能SoC处理器评估板PS端ETH RJ45接口
本文介绍了XINESDSP+FPGA异构评估板，其中XilinxZynq-7000系列XC7Z035/XC7Z045系列主要特性，资源框图及PS端ETHRJ45接口引脚说明。 CPU架构：DSP+FPGA......
【《硬件架构的艺术》读书笔记】02 时钟和复位zui'xiao'zhi1
2.6.1用同步复位进行设计上面两个电路功能一样，但是下面的电路如果load信号为X，触发器便会停在不定态。可以使用编译指令告诉指定的信号为复位信号，综合工具就......
orcale笔记03-DML语句
insertinto：插入数据全表插入：insertinto表名values(值1，值2...);部分列插入：insertinto表名(列1，列2...)values(值1，值2...)从其他表中复制数据：insertin......
[Bug0056] git提示Can't update(master has no tracked branch)
问题、场景、需求（也可）git提示Can'tupdate(masterhasnotrackedbranch)场景gitlab迁移到gitee项目绑定新的地址发现报错多分支原因本地分支和远程分支没有关联，需......
Docker Host '172.17.0.1' is blocked because of many connection errors; unblock w
产生的原因是：同一个ip在短时间内产生太多（超过mysql数据库max_connect_errors的最大值）中断的数据库连接而导致的阻塞解决方法：使用mysqladminflush-hosts命令清......
03#Android 基础：Fragment
Fragment概念官方文档定义：Fragment表示应用界面中可重复使用的一部分。Fragment定义和管理自己的布局，具有自己的生命周期，并且可以处理自己的输入事件。Fragment不能独......
No 'Access-Control-Allow-Origin' header is present on the requested resource.
最近写前后端分离，遇到了跨域问题，很奇怪的是我已经注入了Cors跨域请求，但是每当被JWT的拦截器拦截下来返回未通过时，前端收到的总是无法加载响应数据，琢磨了好一会之后，发现......
How to run python interactive in current file's directory in Visual Studio Code?
Howtorunpythoninteractiveincurrentfile'sdirectoryinVisualStudioCode?问题Whenexecuting"RunSelection/LineinPythonTerminal"commandinVSCod......
UnicodeDecodeError：'gbk' codec can't decode byte 0x80 in position 0 illegal multi
UnicodeDecodeError：'gbk'codeccan'tdecodebyte0x80inposition0illegalmultibytesequence 回答1ifyouwillopenfilewithutf-8,thenyouneedwrite:o......

相关文章

赞助商

阅读排行