参考了别人的blog,不过原文的bug实在有点多,输出的文件样式也不友好,对其进行了优化、debug、测试,重新发布。
ps:测试频率要注意,太频繁会被封IP =、=
原文:
https://www.cnblogs.com/lweiser/p/11042658.html#5136708
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
主页:
https://movie.douban.com/top250
GET
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36
re正则:
# PS: 电影详情页url、图片链接、电影名称、导演、电影上映时间、电影评分、评价人数
# 因为有的电影信息不全,所以正则不再获取主演和简介信息。
'<div class="item">.*?<em class="">(.*?)</em>.*?href="(.*?)">.*?src="(.*?)" class="">.*?<span class="title">(.*?)</span>.*?<div class="bd">.*?导演:(.*?)\ .*?<br>(.*?)\ .*?</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价',
"""
"""
每一页URL:
第一页:https://movie.douban.com/top250
第二页:https://movie.douban.com/top250?start=25&filter=
第三页:https://movie.douban.com/top250?start=50&filter=
.....
第九页:https://movie.douban.com/top250?start=200&filter=
第十页:https://movie.douban.com/top250?start=225&filter=
"""
import requests
import re
import random, time
def get250Mov():
# 请求头
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3770.90 Safari/537.36"
}
base_url = "https://movie.douban.com/top250?start={}&filter="
file_name = "douban.csv"
separator = "\t," # wps貌似只支持,作为csv的分隔符 =。=
start_num = 0
# 表头
with open(file_name, "w", encoding="utf-8") as f:
f.write("")
f.write(
f"序号{separator}电影名称{separator}导演{separator}电影上映时间{separator}电影评分{separator}评价人数{separator}电影url{separator}图片链接\n"
)
for i in range(10):
url = base_url.format(start_num)
start_num += 25
# print("测试:======", url)
# 1、往豆瓣发送请求
response = requests.get(url, headers=headers)
# 2、通过正则解析提取数据
# PS: 电影详情页url、图片链接、电影名称、导演、电影上映时间、电影评分、评价人数
# 暂不支持抓“简介”,有的电影没有简介,用正则捕获会有异常
movie_content_list = re.findall(
'<div class="item">.*?<em class="">(.*?)</em>.*?href="(.*?)">.*?src="(.*?)" class="">.*?'
'<span class="title">(.*?)</span>.*?<div class="bd">.*?导演:(.*?)[ ].*?<br>(.*?) .*?</p>.*?'
'<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价',
response.text,
re.S,
)
# 解压赋值每一部电影
for move_content in movie_content_list:
(
class_no,
detail_url,
image_url,
movie_name,
director,
movie_time,
movie_grade,
number,
) = move_content
detail_data = (
f"{class_no}{separator}{movie_name}{separator}{trans(director)}{separator}{trans(movie_time)}"
f"{separator} {movie_grade}{separator}{number}{separator}{detail_url}{separator}{image_url} \n"
)
# 保存数据
with open(file_name, "a", encoding="utf-8") as f:
f.write(detail_data)
# print(detail_data)
print("{}% completed.".format((i + 1) * 10))
time.sleep(random.random() * 3)
def trans(s):
return s.replace(" ", " ").strip()
if __name__ == "__main__":
get250Mov()
标签:douban,python,movie,电影,爬取,url,separator,top250
From: https://www.cnblogs.com/joyer/p/17014523.html