环境python3.9版本及以上,开发工具pycharm
今天是对requests模块的应用实战,分别是爬取豆瓣电影TOP250的基本信息和电影天堂“2023必看热片”的名称及下载地址。具体如下:
'''
爬取豆瓣电影TOP250的基本信息
思路: 1.拿到页面源代码 2.编写正则,提取页面数据 3.保存数据 ''' import requests import re f = open("top250.csv", mode='w', encoding='utf-8') # 这里通过看网址发现当前页与上下页之间的关系,从而遍历提取出所有页面的数据 for p in range(1, 11): url = "https://movie.douban.com/top250?start=(p-1)*25" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36" } resp = requests.get(url, headers=headers) # resp.encoding = 'utf-8' # 解决乱码问题 pageSource = resp.text # print(pageSource) # 编写正则表达式 # re.S 可以让正则中的.匹配换行符 obj = re.compile(r'<div class="item">.*?<span class="title">(?P<moviename>.*?)</span>' r'.*?<p class="">.*?导演: (?P<daoyan>.*?) ' r'.*?<br>(?P<year>.*?) .*?<span class="rating_num" property="v:average">' r'(?P<score>.*?)</span>.*?<span>(?P<number>.*?)人评价</span>', re.S) # 进行正则匹配 result = obj.finditer(pageSource) for item in result: moviename = item.group("moviename") daoyan = item.group("daoyan") year = item.group("year").strip() # 去掉字符串左右两端的空白 score = item.group("score") number = item.group("number") f.write(f'{moviename},{daoyan},{year},{score},{number}\n') # 如果觉得low,可以更换成csv模块进行数据写入 # print(moviename, daoyan, year, score, number) # f.close() # resp.close() print("豆瓣top250第{}页提取完毕!".format(p))
'''
爬取电影天堂“2023必看热片”的名称及下载地址
思路: 1.提取到主页面中的每一个电影背后的URL地址 1).拿到“2023必看热片”那一块的HTML代码 2).从刚才拿到的HTML代码中提取到href的值 2.访问子页面,提取到电影的名称及下载地址 1).拿到子页面的页面源代码 2).数据提取 ''' import re import requests url = "https://www.dy2018.com/" resp = requests.get(url) resp.encoding = 'gbk' # print(resp.text) # 1.提取2023必看热片部分的HTML代码 obj1 = re.compile(r'2023必看热片.*?<ul>(?P<html>.*?)</ul>', re.S) result1 = obj1.search(resp.text) html = result1.group("html") # print(html) # 2.提取a标签中的href的值 obj2 = re.compile(r"<li><a href='(?P<href>.*?)' title") result2 = obj2.finditer(html) obj3 = re.compile(r'<div id="Zoom">.*?◎片 名(?P<movie>.*?)<br />' r'.*? <td style="WORD-WRAP: break-word" bgcolor="#fdfddf"><a href="(?P<download>.*?)">', re.S) for item in result2: # print(item.group("href")) # 拼接出子页面的url child_url = url.strip("/") + item.group("href") child_resp = requests.get(child_url) child_resp.encoding = 'gbk' # print(child_resp.text) result3 = obj3.search(child_resp.text) print(result3.group("movie")) print(result3.group("download"))
牙疼引发头疼,难顶!!
标签:group,item,Python,day4,爬虫,re,url,print,resp From: https://www.cnblogs.com/Hyun79/p/17299154.html