写了两天,参考其他大牛的文章,摸着石头过河,终于写出了一个可以爬B站评论区的爬虫,人裂了……
致谢 :
致谢:
SmartCrane
马哥python说
该程序具有以下功能:
①输入B站视频链接,即可爬取B站评论区评论、IP、ID、点赞数、回复数,并保存在当前目录的以视频名字为标题的csv文件中。
②由视频链接可得视频标题、oid
注意事项:
①代码隐去了User-Agent和cookie,运行需补全,cookie不写入headers会爬不到IP属地,感谢马哥python说
②出现这个问题:
comment = s['data']['replies'][i]
~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
随着评论区的退移,后面pages的评论页面对应的replies没有那么多项了(不像前面页码有17、18个,最后一面可能只有4、5个),例如有pages = 21面,评论共401条,则前20页每页都有20个replies , 而第21页只有一个reply,因此最后一页s['data']['replies'][2]就会报错.(其实这个报错也不影响最后输出的csv文件)
解决方案:适当减小main中页码的范围,找前面的页码就行了
import requests
import json
import time
import re
CommentList = [['IP属地' , '昵称' , 'ID' , '发表时间' , '评论内容' , '点赞数' , '回复数']]
headers = {
'authority': 'api.bilibili.com',
'accept': 'application/json, text/plain, */*',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cookie' : "请填入你的浏览器cookie" ,
# change the cookie on time , or location cann't be get .
'origin': 'https://www.bilibili.com',
'referer': 'https://www.bilibili.com/video/BV1FG4y1Z7po/?spm_id_from=333.337.search-card.all.click&vd_source=69a50ad969074af9e79ad13b34b1a548',
'sec-ch-ua': '"Chromium";v="106", "Microsoft Edge";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': '请填入你的User-agent'
}
def fetchURL(url):
# response = requests.get(url = url , headers=headers)
#print('Have got the html\n')
# return response.text
return requests.get(url=url , headers=headers).text
def parseHtml(html):
s = json.loads(html)
# print(f"the type of html is {type(html)}\n") Str
#print(f"the type of s is {type(s['data']['replies'])}\n") # Dictionary
# the function 'json.loads' turned string into dictionary
'''
CommentList = []
hlist = []
hlist.append('IP属地')
hlist.append('昵称')
hlist.append('ID')
hlist.append('发表时间')
hlist.append('评论内容')
hlist.append('点赞数')
hlist.append('回复数')
CommentList.append(hlist)
'''
for i in range(0 , 15) :
comment = s['data']['replies'][i]
location = comment['reply_control']['location'][5:]
#print(location[5:])
uname = comment['member']['uname']
ID = comment['member']['mid']
ctime = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(comment['ctime'])) # time of comment
content = comment['content']['message']
like = comment['like']
rcount = comment['rcount']
alist = []
alist.append(location)
alist.append(uname)
alist.append(ID)
alist.append(ctime)
alist.append(content)
alist.append(like)
alist.append(rcount)
CommentList.append(alist)
#print('Have got CommentList\n')
#return CommentList
def Write2Csv(List , title ) :
import pandas as pd
dataframe = pd.DataFrame(List)
# print(title)
# filepath ='D:\Desktop\SpiderOfComment\CSVComment\\' + title + '.csv'
filepath = title +'.csv'
# print(filepath)
dataframe.to_csv( filepath , index=False , sep = ',' , header = False , encoding='utf_8_sig')
#dataframe.to_excel('')
#print('Have written to the csv\n')
def GetTitle(url):
page_text = requests.get(url=url , headers=headers).text
#print(page_text)
ex = '<h1 title="(.*?)".*?</h1>'
title = re.findall(ex , page_text , re.S)[0]
# print(type(title))
return title
def GetOid(url):
page_text = requests.get(url=url , headers=headers).text
ex = '</script><script>window.__INITIAL_STATE__={"aid":(.*?),"bvid":'
# </script><script>window.__INITIAL_STATE__={"aid":269261816,"bvid"
oid = re.findall(ex , page_text , re.S)[0]
return oid
if __name__ == '__main__' :
print('Please input the url of the vedio:')
temp_url = input()
title = GetTitle(temp_url)
# print(type(title))
oid = GetOid(temp_url)
url0 = 'https://api.bilibili.com/x/v2/reply?type=1&oid=' + oid + '&sort=2&pn='
# print(url)
print('Wait……')
for i in range(1, 20):
url = url0 + str(i)
html = fetchURL(url)
parseHtml(html)
# Write2Csv(CommentList)
if(i%5==0):
time.sleep(1)
Write2Csv(CommentList , title)
print('Success!\n')
标签:comment,title,Python,爬虫,爬取,url,print,append,headers
From: https://www.cnblogs.com/benkangpeng/p/17600683.html