一、简介
使用自动化工具playwright获取B站视频下方评论区的用户名、用户性别、评论内容以及IP属地。
二、获取思路
进入视频页面,在Network中,发现评论文件存储在“main?oid=XXXX”中,且随着鼠标不断下滑,不断更新。
那么,我们只需要设置模拟用户鼠标操作,且在下滑过程中设定好监听事件,不断获取评论内容并保存。直到下拉到评论的底部。解析获取到的json文件。
def monitor_response(self, res):
if "api.bilibili.com/x/v2/reply/wbi/main?oid=" in res.url:
data = res.json()["data"]["replies"]
index = res.url[-10:]
self.comments_message[index] = data
三、完整代码
首先得进入到b站首页手动登录。
(这一步骤在之前python+playwright爬取招聘网站_进击no猪排的技术博客_51CTO博客 中的 二(二)②Ⅱ 有类似的介绍。)
获取评论json文件
# bilibi_comments.py
from playwright.sync_api import sync_playwright
import json
class comments_links_scrapy():
def __init__(self):
self.url = input("请将网址粘贴到此处:")
self.comments_message = {}
def monitor_response(self, res):
if "api.bilibili.com/x/v2/reply/wbi/main?oid=" in res.url:
data = res.json()["data"]["replies"]
index = res.url[-10:]
print(res.url)
self.comments_message[index] = data
def scrapy(self):
try:
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp('http://localhost:12345/')
page = browser.contexts[0].pages[0]
page.on("response", self.monitor_response)
page.goto(self.url)
page.wait_for_timeout(3000)
previous_height = 0
while True:
current_height = page.evaluate('document.documentElement.scrollHeight')
if previous_height < current_height:
previous_height = current_height
page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
page.wait_for_timeout(1000)
else:
break
with open(f"comments.json", mode="a", encoding="utf-8") as fp:
json.dump(self.comments_message, fp)
browser.close()
except Exception:
pass
if __name__ == '__main__':
test = comments_links_scrapy()
test.scrapy()
解析json文件另存为csv文件
# get_comments.py
import json
import csv
with open("comments.json", "r", encoding="utf-8") as f:
data = json.loads(f.read())
with open('comments.csv', 'w', newline='', encoding='utf-8-sig') as f2:
colnames = ['用户名', '用户性别', '评论内容', 'IP属地']
writer = csv.DictWriter(f2, fieldnames=colnames)
writer.writeheader()
for item in data:
for i in data[item]:
writer.writerow({
'用户名': i["member"]["uname"],
'用户性别': i["member"]["sex"],
'评论内容': i["content"]["message"],
'IP属地': i["reply_control"]["location"][5:]
})
(纯练练手,抓取过程似乎不太稳定。)
标签:playwright,res,self,用户,comments,获取,json,data,page From: https://blog.51cto.com/goku0623/9256984