Python爬虫爬取B站评论区

标签：comment title Python 爬虫爬取 url print append headers

写了两天，参考其他大牛的文章，摸着石头过河，终于写出了一个可以爬B站评论区的爬虫，人裂了……
致谢 :
致谢：
SmartCrane
马哥python说
该程序具有以下功能：
①输入B站视频链接，即可爬取B站评论区评论、IP、ID、点赞数、回复数，并保存在当前目录的以视频名字为标题的csv文件中。
②由视频链接可得视频标题、oid

注意事项：
①代码隐去了User-Agent和cookie，运行需补全，cookie不写入headers会爬不到IP属地，感谢马哥python说
②出现这个问题：

 comment = s['data']['replies'][i]
           ~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

随着评论区的退移，后面pages的评论页面对应的replies没有那么多项了（不像前面页码有17、18个，最后一面可能只有4、5个），例如有pages = 21面，评论共401条，则前20页每页都有20个replies , 而第21页只有一个reply，因此最后一页s['data']['replies'][2]就会报错.(其实这个报错也不影响最后输出的csv文件)
解决方案：适当减小main中页码的范围，找前面的页码就行了

import requests
import json
import time
import re

CommentList = [['IP属地' , '昵称' , 'ID' , '发表时间' , '评论内容' , '点赞数' , '回复数']] 
headers = {
    'authority': 'api.bilibili.com',
    'accept': 'application/json, text/plain, */*',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cookie' : "请填入你的浏览器cookie" ,
    # change the cookie on time , or location cann't be get .
    'origin': 'https://www.bilibili.com',
    'referer': 'https://www.bilibili.com/video/BV1FG4y1Z7po/?spm_id_from=333.337.search-card.all.click&vd_source=69a50ad969074af9e79ad13b34b1a548',
    'sec-ch-ua': '"Chromium";v="106", "Microsoft Edge";v="106", "Not;A=Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-site',
    'user-agent': '请填入你的User-agent'
}
    
def fetchURL(url):
    # response = requests.get(url = url , headers=headers)
    #print('Have got the html\n')
    # return response.text
    return requests.get(url=url , headers=headers).text

def parseHtml(html):
    s = json.loads(html)
    # print(f"the type of html is {type(html)}\n")  Str 
    #print(f"the type of s is {type(s['data']['replies'])}\n")       # Dictionary 
    # the function 'json.loads' turned string into dictionary
    '''
    CommentList = [] 
    hlist = []
    hlist.append('IP属地')
    hlist.append('昵称')
    hlist.append('ID')
    hlist.append('发表时间')
    hlist.append('评论内容')
    hlist.append('点赞数')
    hlist.append('回复数')
    CommentList.append(hlist)
    '''

    for i in range(0 , 15) :
        comment = s['data']['replies'][i]
        location = comment['reply_control']['location'][5:]
        #print(location[5:])
        uname = comment['member']['uname']
        ID = comment['member']['mid']
        ctime = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(comment['ctime'])) # time of comment
        content = comment['content']['message']
        like = comment['like']
        rcount = comment['rcount']

        alist = []
        alist.append(location)
        alist.append(uname)
        alist.append(ID)
        alist.append(ctime)
        alist.append(content)
        alist.append(like)
        alist.append(rcount)
        CommentList.append(alist)
        
    
    #print('Have got CommentList\n')  
    #return CommentList


def Write2Csv(List , title ) :
    import pandas as pd
    dataframe = pd.DataFrame(List)
    # print(title)
    # filepath ='D:\Desktop\SpiderOfComment\CSVComment\\' + title + '.csv'
    filepath = title +'.csv'
    # print(filepath)
    dataframe.to_csv( filepath , index=False , sep = ',' ,  header = False , encoding='utf_8_sig')
    #dataframe.to_excel('')
    #print('Have written to the csv\n')

def GetTitle(url):
    page_text = requests.get(url=url , headers=headers).text
    #print(page_text)
    ex = '<h1 title="(.*?)".*?</h1>'
    title = re.findall(ex , page_text , re.S)[0]
    # print(type(title))
    return title

def GetOid(url):
    page_text = requests.get(url=url , headers=headers).text
    ex = '</script><script>window.__INITIAL_STATE__={"aid":(.*?),"bvid":'
    #  </script><script>window.__INITIAL_STATE__={"aid":269261816,"bvid"
    oid = re.findall(ex , page_text , re.S)[0]
    return oid 


if __name__ == '__main__' :
    print('Please input the url of the vedio:')
    temp_url = input()
    title = GetTitle(temp_url)
    # print(type(title))
    oid = GetOid(temp_url)
    url0 = 'https://api.bilibili.com/x/v2/reply?type=1&oid=' + oid + '&sort=2&pn='
    # print(url)
    print('Wait……')
    for i in range(1, 20):
        url = url0 + str(i)
        html = fetchURL(url) 
        parseHtml(html)
        # Write2Csv(CommentList)
        if(i%5==0):
            time.sleep(1)
    Write2Csv(CommentList , title)
    print('Success!\n')

标签：comment,title,Python,爬虫,爬取,url,print,append,headers
From： https://www.cnblogs.com/benkangpeng/p/17600683.html

Python爬虫爬取B站评论区

相关文章

赞助商

阅读排行