首页 > 其他分享 >feapder框架爬取ks评论_递归的方式

feapder框架爬取ks评论_递归的方式

时间:2024-06-07 13:37:35浏览次数:24  
标签:爬取 http pcursor feapder 8089 ks photoId 80 data

import random
import re
import time
from feapder.db.mysqldb import MysqlDB
import feapder

def is_number(string):
pattern = re.compile(r'^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$')
return bool(pattern.match(string))

class AirSpiderDemo(feapder.AirSpider):

photoId = "3x2nh3ssaispdie"

db = MysqlDB()
def start_requests(self):
url = "https://www.kuaishou.com/graphql"
data = {
"operationName": "commentListQuery",
"variables": {
"photoId": self.photoId,
"pcursor": ""
},
"query": "query commentListQuery($photoId: String, $pcursor: String) {\n visionCommentList(photoId: $photoId, pcursor: $pcursor) {\n commentCount\n pcursor\n rootComments {\n commentId\n authorId\n authorName\n content\n headurl\n timestamp\n likedCount\n realLikedCount\n liked\n status\n authorLiked\n subCommentCount\n subCommentsPcursor\n subComments {\n commentId\n authorId\n authorName\n content\n headurl\n timestamp\n likedCount\n realLikedCount\n liked\n status\n authorLiked\n replyToUserName\n replyTo\n __typename\n }\n __typename\n }\n __typename\n }\n}\n"
}
yield feapder.Request(url, json=data, method="POST",download_midware=self._midware)


def _midware(self, request):
ip_list = [{'http': '23.226.117.85:8080'}, {'http': '117.69.233.218:8089'}, {'http': '212.127.93.185:8081'},
{'http': '45.12.30.112:80'}, {'http': '190.82.91.205:999'}, {'http': '94.73.239.124:55443'},
{'http': '144.255.49.43:9999'}, {'http': '125.99.34.94:8080'}, {'http': '183.164.242.66:8089'},
{'http': '47.92.6.221:8089'}, {'http': '114.231.41.157:8888'}, {'http': '60.174.1.93:8089'},
{'http': '103.143.197.19:8080'}, {'http': '194.182.163.117:3128'}, {'http': '117.69.236.113:8089'},
{'http': '190.107.236.169:999'}, {'http': '104.129.198.41:8800'}, {'http': '119.93.145.82:3128'},
{'http': '152.66.208.22:80'}, {'http': '190.111.214.234:8181'}, {'http': '104.129.198.217:8800'},
{'http': '31.197.253.254:48678'}, {'http': '85.62.218.250:3128'}, {'http': '112.243.88.8:9000'},
{'http': '46.40.6.201:7777'}, {'http': '103.48.68.36:83'}, {'http': '179.1.110.230:8080'},
{'http': '181.31.225.234:3128'}, {'http': '41.33.66.237:1976'}, {'http': '183.164.243.43:8089'},
{'http': '137.74.65.101:80'}, {'http': '106.75.86.143:1080'}, {'http': '121.41.87.128:80'},
{'http': '118.27.33.17:8118'}, {'http': '36.6.144.34:8089'}, {'http': '91.243.192.17:3128'},
{'http': '112.53.184.170:9091'}, {'http': '119.252.171.50:8080'}, {'http': '193.41.88.58:53281'},
{'http': '202.80.43.204:8080'}, {'http': '183.166.137.201:41122'}, {'http': '36.6.145.73:8089'},
{'http': '117.57.93.94:8089'}, {'http': '212.174.242.114:8080'}, {'http': '36.6.145.224:8089'},
{'http': '183.164.242.146:8089'}, {'http': '92.249.122.108:61778'}, {'http': '88.255.102.37:8080'},
{'http': '91.226.92.7:80'}, {'http': '115.236.55.186:10100'}, {'http': '198.50.237.23:80'},
{'http': '104.18.24.139:80'}, {'http': '172.67.181.107:80'}, {'http': '92.249.113.194:55443'},
{'http': '141.95.241.100:80'}, {'http': '63.239.220.109:8080'}, {'http': '41.220.104.65:8080'},
{'http': '20.219.180.149:3129'}, {'http': '198.211.117.231:80'}, {'http': '117.69.232.127:8089'},
{'http': '91.226.92.19:80'}, {'http': '105.112.95.133:8080'}, {'http': '117.71.155.108:8089'},
{'http': '187.60.219.4:3128'}, {'http': '117.71.149.100:8089'}, {'http': '117.71.133.79:8089'},
]
request.proxies = random.choice(ip_list)
request.headers = {
"Origin": "https://www.kuaishou.com",
"Referer": "https://www.kuaishou.com/short-video/3xn9n6gnnva545m",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",

}
request.cookies = {
"did": "web_544f660eef074162a0abfbc3bfabbba8",
"didv": "1717656796000",
"kpf": "PC_WEB",
"clientid": "3",
"kpn": "KUAISHOU_VISION"
}
return request

def parse(self, request, response):
print(response)
data = response.json
# print(data)
print(data['data']['visionCommentList']['pcursor'])
pcursor_id = data['data']['visionCommentList']['pcursor']

rootComments = data['data']['visionCommentList']['rootComments']
contentlist = []
for i_dict in rootComments:
items = {}
items['content'] = i_dict['content']
contentlist.append(items)
print(contentlist)

# 在自己的数据库要设置对应得表,建好表标后才可以数据导入数据库 db_3为对应得表名称
# self.db.add_batch_smart("ks_table", contentlist)
# 睡上几秒,免得被系统检测是爬虫
random_float = round(random.uniform(1.0, 2.0), 1)
time.sleep(random_float)
# 判断响应中是否有pcursor_id,且判断是否是数字类型,不是的话就结束爬虫
if is_number(pcursor_id):
url = "https://www.kuaishou.com/graphql"
data = {
"operationName": "commentListQuery",
"variables": {
"photoId": self.photoId,
"pcursor": pcursor_id
},
"query": "query commentListQuery($photoId: String, $pcursor: String) {\n visionCommentList(photoId: $photoId, pcursor: $pcursor) {\n commentCount\n pcursor\n rootComments {\n commentId\n authorId\n authorName\n content\n headurl\n timestamp\n likedCount\n realLikedCount\n liked\n status\n authorLiked\n subCommentCount\n subCommentsPcursor\n subComments {\n commentId\n authorId\n authorName\n content\n headurl\n timestamp\n likedCount\n realLikedCount\n liked\n status\n authorLiked\n replyToUserName\n replyTo\n __typename\n }\n __typename\n }\n __typename\n }\n}\n"
}

yield feapder.Request(url, json=data, method="POST",callback=self.parse,download_midware=self._midware)
else:
print("pcursor_id没有了")
return 0




if __name__ == "__main__":
AirSpiderDemo(thread_count=1).start()

标签:爬取,http,pcursor,feapder,8089,ks,photoId,80,data
From: https://www.cnblogs.com/Lhptest/p/18237029

相关文章

  • tapPromise 函数 (绑定hooks方法)tapable 库,创建自定义插件的库
    tapPromise函数(绑定hooks方法)tapable库,创建自定义插件的库刚看到了一个插件的use函数//引入组件use(plugin:IPluginClass,options?:IPluginOption){if(this._checkPlugin(plugin)&&this.canvas){this._saveCustomAttr(plugin);constpluginRu......
  • Sql数据库利用linkserver和 CT[CHANGE_TRACKING]实现发布订阅
    源服务器初始化同步数据表SELECT*INTO【用于同步数据的表名】FROM( SELECTtop0 CT.SYS_CHANGE_VERSION, CT.SYS_CHANGE_OPERATION, CT.【同步数据表的主键ID】 FROMCHANGETABLE(CHANGES源数据表名,0)ASCT)t创建获取同步数据存储......
  • Vue 3 Composition API与Hooks模式
    Vue3的CompositionAPI引入了Hook函数的概念,这是一种更加模块化和可重用的状态管理和逻辑组织方式。自定义Hook首先,我们创建一个自定义Hook,例如useCounter,它封装了计数器的逻辑://useCounter.jsimport{ref}from'vue';exportfunctionuseCounter(){c......
  • 成员推理攻击(Membership Inference Attacks Against Machine Learning Models)通俗易懂
    成员推理攻击是一种面向AI模型的数据隐私窃取,攻击者以判断==数据是否来源于AI模型的训练集==为目标,本质上是对未知来源的数据进行==二分类==,给出成员数据或者非成员数据的判定。攻击者训练一个二分类器,该分类器将==目标分类器==预测的数据样本的置信度分数向量作为输入,预测该......
  • ESSEN: Improving Evolution State Estimation for Temporal Networks using Von Neum
    我们采用以下六个分类标准:研究重点:这个标准突出了研究的核心目标。网络表示学习旨在找到有效的方法,将复杂的网络结构表示在低维空间中,使其更易于分析并在机器学习任务中使用。例如,Kipf和Welling[7]引入了图卷积网络(GCN)用于静态图上的半监督分类,而Nguyen等人[1......
  • Scalable Membership Inference Attacks via Quantile Regression
    我们使用以下六个分类标准:动机:隐私问题:许多研究背后的主要动机是对机器学习模型相关的隐私风险日益增长的担忧。例如,Shokri等人(2017)和Carlini等人(2022)专注于开发和改进成员推理攻击,以评估模型对隐私泄露的脆弱性。模型理解:一些研究深入了解机器学习模型的固有属性。Y......
  • 多线程实现爬取图片
    importosimportthreadingimportrequestsfromget_img_urlimportget_img_url#下载单张图片方法,方法入参为图片url地址和图片名称defdownload_image(url,filename):response=requests.get(url)withopen(filename,'wb')asf:f.write(respon......
  • 供应链安全论文阅读(一)Backstabber's Knife Collection: A Review of Open Source Soft
    引言该论文Backstabber'sKnifeCollection:AReviewofOpenSourceSoftwareSupplyChainAttacks发表在2020年的DIMVA上,作者为波恩大学的MarcOhm。本文是开源软件供应链安全领域较早期的一篇论文,主要针对软件供应链中恶意软件包的威胁进行了详细介绍。首先简单介绍一下软......
  • 基于Java+Dijkstra算法的地铁线路换乘最短路径项目(免费提供全部源码)
    下载地址如下:基于Java+Dijkstra算法的地铁线路换乘最短路径项目(免费提供全部源码)资源-CSDN文库项目介绍背景随着城市化进程的不断推进,地铁已成为现代大城市公共交通系统的核心组成部分。地铁线路的日益复杂和站点的不断增加,使得乘客在出行时面临换乘路线选择的困扰。为了提......
  • 【数据库】StarRocks、Hive、ClickHouse、Tidb的对比及使用场景
    特性StarRocksHiveClickHouseTiDB数据存储列存储(ColumnarStorage)行存储(RowStorage)列存储(ColumnarStorage)混合存储(行存储和列存储)查询性能高低高高主要用途实时分析(Real-timeAnalytics)大数据批处理(BatchProcessing)实时分析(Real-timeAnalytics)OLTP与O......