豆瓣影评数据抓取
创建时间:2024-08-12
抓取豆瓣影评相关数据的代码,包括封面、标题、评论内容以及影评详情页的数据。
一、完整代码
'''
https://movie.douban.com/review/best/
抓取封面 标题 评论內容
抓取完整的评论内容 也就是点击展开后的完整的
抓取当前影评的详情页的数据
抓取影评多页 封面 标题 完整评论内容 以及影评的详情页的数据
'''
import json
import re
import requests
from lxml import etree
url = 'https://movie.douban.com/review/best/'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
response = requests.get(url, headers=header)
response.encoding = 'utf-8'
html = response.text
tree = etree.HTML(html)
fm_urls = tree.xpath('//div[@class="review-list chart "]//a[@class="subject-img"]/img/@src')
bt_list = tree.xpath('//div[@class="main-bd"]/h2/a/text()')
# for fm, bt in zip(fm_urls, bt_list):
# res = requests.get(fm, headers=header)
# with open('./imgs/' + bt + '.jpg', 'wb') as f:
# f.write(res.content)
# print(bt + ' 已保存')
details_list = tree.xpath('//div[@class="main-bd"]/h2/a/@href')
details_urls = []
for i in details_list:
num = re.findall(r'\d+', i)[0]
details_url = f'https://movie.douban.com/j/review/{num}/full'
details_urls.append(details_url)
# https://movie.douban.com/j/review/15980218/full
for bt, details_url in zip(bt_list, details_urls):
response = requests.get(details_url, headers=header)
response.encoding = response.apparent_encoding
data = json.loads(response.text)
# print(data['body'])
datatree = etree.HTML(data['body'])
details = datatree.xpath('//text()')
# print(details)
detailedInfo = '\n' + bt + '\n' + ''.join(details)
with open('detail.txt', 'a+', encoding='utf-8') as f:
f.write(detailedInfo)
print(f'{bt}详情的内容全部下载完毕!!!')
# exit()
二、代码详解
2.1 基本设置
导入了所需的库,并设置了要访问的豆瓣影评页面的 URL 和请求头,以模拟真实的浏览器访问。
import json
import re
import requests
from lxml import etree
url = 'https://movie.douban.com/review/best/'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
2.2 发送请求并获取页面的 HTML 内容
response = requests.get(url, headers=header)
response.encoding = 'utf-8'
html = response.text
tree = etree.HTML(html)
2.3 使用 xpath
方法提取出封面图片的 URL 和标题
fm_urls = tree.xpath('//div[@class="review-list chart "]//a[@class="subject-img"]/img/@src')
bt_list = tree.xpath('//div[@class="main-bd"]/h2/a/text()')
2.4 保存封面图片
for fm, bt in zip(fm_urls, bt_list):
res = requests.get(fm, headers=header)
with open('./imgs/' + bt + '.jpg', 'wb') as f:
f.write(res.content)
print(bt + ' 已保存')
2.5 获取影评详情页的 URL 并构建完整的请求链接
details_list = tree.xpath('//div[@class="main-bd"]/h2/a/@href')
details_urls = []
for i in details_list:
num = re.findall(r'\d+', i)[0]
details_url = f'https://movie.douban.com/j/review/{num}/full'
details_urls.append(details_url)
2.6 获取影评详情页的完整内容并保存
for bt, details_url in zip(bt_list, details_urls):
response = requests.get(details_url, headers=header)
response.encoding = response.apparent_encoding
data = json.loads(response.text)
# print(data['body'])
datatree = etree.HTML(data['body'])
details = datatree.xpath('//text()')
# print(details)
detailedInfo = '\n' + bt + '\n' + ''.join(details)
with open('detail.txt', 'a+', encoding='utf-8') as f:
f.write(detailedInfo)
print(f'{bt}详情的内容全部下载完毕!!!')