使用 Python 爬取豆瓣电影 Top250 多页数据
创建时间:2024-08-11
一、完整代码
'''
抓取单贞数据 中的评分 简介 评价人数
将上面的改为多页
https://movie.douban.com/top250?start=0
'''
import requests
from lxml import etree
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
url = 'https://movie.douban.com/top250?start=0'
pag_num = int(input('请输入需要抓取的页数,一页25条数据,共10页:')) * 25 - 25
if pag_num <= 0:
print('输入错误!!')
exit(-1)
for num in range(0, pag_num + 25, 25):
url = f'https://movie.douban.com/top250?start={num}'
# print(url)
res = requests.get(url, headers=header)
# print(res.status_code)
res.encoding = res.apparent_encoding
tree = etree.HTML(res.text)
movies_name = tree.xpath('//ol[@class="grid_view"]//img/@alt')
movies_start = tree.xpath('//ol[@class="grid_view"]//span[@class="rating_num"]/text()')
movies_intro = tree.xpath('//ol[@class="grid_view"]//span[@class="inq"]//text()')
movies_evaluators = tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[4]/text()')
for name, start, intro, evaluator in zip(movies_name, movies_start, movies_intro, movies_evaluators):
movie = name + ',' + start + ',' + intro + ',' + evaluator + '\n'
print(f'{name}-----下载成功~~~~')
with open('top250.txt', 'a+', encoding='utf-8') as file:
file.write(movie)
1.1 效果
二、代码解析
2.1 设置了请求的头部信息
模拟正常的浏览器访问,以避免被网站识别为爬虫并阻止访问。
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
2.2 参数设置
根据输入,通过循环构建不同的 URL 来访问不同的页面。
pag_num = int(input('请输入需要抓取的页数,一页 25 条数据,共 10 页:')) * 25 - 25
if pag_num <= 0:
print('输入错误!!')
exit(-1)
for num in range(0, pag_num + 25, 25):
url = f'https://movie.douban.com/top250?start={num}'
2.3 解析数据
在获取到每个页面的 HTML 内容后,使用 lxml 的 etree 模块解析页面,并通过 xpath 表达式提取出电影的名称、评分、简介和评价人数等信息。
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
tree = etree.HTML(res.text)
movies_name = tree.xpath('//ol[@class="grid_view"]//img/@alt')
movies_start = tree.xpath('//ol[@class="grid_view"]//span[@class="rating_num"]/text()')
movies_intro = tree.xpath('//ol[@class="grid_view"]//span[@class="inq"]//text()')
movies_evaluators = tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[4]/text()')
2.4 合并保存
将提取到的每部电影的信息组合成一条记录,并保存到一个名为 top250.txt
的文件中。
for name, start, intro, evaluator in zip(movies_name, movies_start, movies_intro, movies_evaluators):
movie = name + ',' + start + ',' + intro + ',' + evaluator + '\n'
print(f'{name}-----下载成功~~~~')
with open('top250.txt', 'a+', encoding='utf-8') as file:
file.write(movie)
标签:25,name,start,Python,爬取,movies,num,Top250,class
From: https://www.cnblogs.com/suifeng2000/p/18353320