并发爬虫_使用motor保存数据

时间：2024-10-12 18:21:19浏览次数：8

标签：item car self get 爬虫并发 result motor id

 1 import redis
 2 import chardet
 3 import hashlib
 4 import asyncio
 5 import aiohttp
 6 from lxml import etree
 7 from fake_useragent import UserAgent
 8 from motor.motor_asyncio import AsyncIOMotorClient
 9 
10 
11 class CarSpider:
12     user_agent = UserAgent()
13     redis_client = redis.Redis()
14     mongo_client = AsyncIOMotorClient('localhost', 27017)['py_spider']['car_info']
15 
16     def __init__(self):
17         self.url = 'https://www.che168.com/china/a0_0msdgscncgpi1ltocsp{}exf4x0/?pvareaid=102179#currengpostion'
18         self.api_url = 'https://cacheapigo.che168.com/CarProduct/GetParam.ashx?specid={}'
19 
20     def __del__(self):
21         # 爬虫完毕时关闭redis服务
22         self.redis_client.close()
23 
24     # 获取汽车id
25     async def get_car_id(self, page, session):
26         async with session.get(self.url.format(page), headers={'User-Agent': self.user_agent.random}) as response:
27             content = await response.read()
28             encoding = chardet.detect(content)['encoding']
29 
30             if encoding == 'GB2312' or encoding == 'ISO-8859-1':
31                 result = content.decode('gbk')
32                 tree = etree.HTML(result)
33                 id_list = tree.xpath('//ul[@class="viewlist_ul"]/li/@specid')
34                 if id_list:
35                     # 创建获取汽车详细信息的task任务
36                     tasks = [loop.create_task(self.get_car_info(spec_id, session)) for spec_id in id_list]
37                     await asyncio.wait(tasks)
38                 else:
39                     print('id为空...')
40             else:
41                 print('错误页面...')
42 
43     # 获取汽车详细信息
44     async def get_car_info(self, spec_id, session):
45         async with session.get(self.api_url.format(spec_id), headers={'User-Agent': self.user_agent.random}) as response:
46             result = await response.json()
47             if result['result'].get('paramtypeitems'):
48                 item = dict()
49                 item['name'] = result['result']['paramtypeitems'][0]['paramitems'][0]['value']
50                 item['price'] = result['result']['paramtypeitems'][0]['paramitems'][1]['value']
51                 item['brand'] = result['result']['paramtypeitems'][0]['paramitems'][2]['value']
52                 item['altitude'] = result['result']['paramtypeitems'][1]['paramitems'][2]['value']
53                 item['breadth'] = result['result']['paramtypeitems'][1]['paramitems'][1]['value']
54                 item['length'] = result['result']['paramtypeitems'][1]['paramitems'][0]['value']
55                 await self.save_car_info(item)
56             else:
57                 print('数据不存在...')
58 
59     # 数据去重
60     @staticmethod
61     def get_md5(dict_item):
62         md5 = hashlib.md5()
63         md5.update(str(dict_item).encode('utf-8'))
64         return md5.hexdigest()
65 
66     # 数据保存
67     async def save_car_info(self, item):
68         md5_hash = self.get_md5(item)
69         redis_result = self.redis_client.sadd('car:filter', md5_hash)
70         if redis_result:
71             await self.mongo_client.insert_one(item)
72             print('数据插入成功:', item)
73         else:
74             print('数据重复...')
75 
76     async def main(self):
77         async with aiohttp.ClientSession() as session:
78             tasks = [asyncio.create_task(self.get_car_id(page, session)) for page in range(1, 101)]
79             await asyncio.wait(tasks)
80 
81 
82 if __name__ == '__main__':
83     loop = asyncio.get_event_loop()
84     car_spider = CarSpider()
85     loop.run_until_complete(car_spider.main())

标签：item,car,self,get,爬虫,并发,result,motor,id
From： https://www.cnblogs.com/kojya/p/18461167

高清图解28个高并发之数据结构/数据结构场景匹配技巧分析(高并发精通篇一)
Java集合以ArrayList、LinkedList、HashSet、TreeSet和HashMap等组件为核心，构筑了强大而灵活的数据结构体系。这些组件精心设计以满足不同的性能和功能需求，如ArrayList的动态数组支持快速随机访问，而LinkedList的双向链表结构则擅长于频繁的插入和删除操作。HashSe......
Java并发编程常见面试题
1.简要描述线程和进程的关系,区别以及优缺点进程:操作系统为程序分配的资源集合,每个进程拥有独立的地址空间。线程:同一个进程可以包含多个线程,他们共享线程的地址空间和资源。一个进程中可以有多个线程，多个线程共享进程的堆和方法区资源，但是每个线程有自己的程序......
Java并发编程-线程池
ThreadLocal应用场景：两个线程争执一个资源。解决问题：实现每个线程绑定自己的专属本地变量，可以将ThreadLocal类理解成存放数据的盒子，盒子中存放每个线程的私有数据。线程池的用途选择快速响应用户请求：比如说用户查询商品详情页，会涉及查询商品关联的一系列信息如价格、优......
初始爬虫13（js逆向）
为了解决网页端的动态加载，加密设置等，所以需要js逆向操作。JavaScript逆向可以分为三大部分：寻找入口，调试分析和模拟执行。 1.chrome在爬虫中的作用 1.1preservelog的使用默认情况下，页面发生跳转之后，之前的请求url地址等信息都会消失，勾选perservelog后之......
基于大型语言模型爬虫项目Crawl4AI介绍
Crawl4AI是一款专为大型语言模型（LLMs）和AI应用设计的开源网页爬虫和数据提取工具。最近挺火的开源AI网络爬虫工具：Crawl4AI可以直接用于大语言模型和AI应用。性能超快，还能输出适合大语言模型的格式，比如JSON、清理过的HTML和markdown。它还支持同时爬取多个网址，能提取所有......
基于JAVA+SpringBoot+Vue+协同过滤算法+爬虫的前后端分离的租房系统
✌全网粉丝20W+,csdn特邀作者、博客专家、CSDN新星计划导师、java领域优质创作者,博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌......
DCL&并发事务问题与解决 -2024/10/10
DCLusemysql;--创建用户createuser'yd'@'localhost'identifiedby'123456';--修改用户的密码alteruser'yd'@'localhost'identifiedby'1234';--%表示任意主机都可以访问--删除用户dropuser'yd'@'l......
抖店商家电话搜集工具抖音商家电话爬虫店铺采集器
分享小编:电商小达人作者:1030249563(V)Java爬虫的实现在Java中，我们可以使用Jsoup库来简化网络请求和HTML解析的过程。以下是一个简单的爬虫示例代码，用于抓取抖音小店中的商品信息。Maven依赖首先，你需要在项目的pom.xml文件中添加Jsoup的依赖：org.jsoupjsoup1.14.3......
高清图解28个高并发之数据结构/数据结构场景匹配技巧分析(高并发精通篇一)
Java集合以ArrayList、LinkedList、HashSet、TreeSet和HashMap等组件为核心，构筑了强大而灵活的数据结构体系。这些组件精心设计以满足不同的性能和功能需求，如ArrayList的动态数组支持快速随机访问，而LinkedList的双向链表结构则擅长于频繁的插入和删除操作。HashSet基于......
最新毕设-Python-旅游数据分析与可视化系统-48196（免费领项目）可做计算机毕业设计JAVA、
基于python的旅游数据分析与可视化系统的设计与实现摘要本文旨在设计和实现一个基于Python的旅游数据分析可视化系统。该系统以旅游数据为研究对象，利用Python的数据处理能力和可视化技术，对旅游数据进行深入分析，并通过直观的可视化图表展示分析结果。本文首先介绍了旅游数......

并发爬虫_使用motor保存数据

相关文章

赞助商

阅读排行