步骤
首先利用pip指令安装所需要的soup以及request库(pip下载速度慢可使用pip镜像,更改下载路径到国内网站)
然后对猫眼电影网站进行分析,利用request进行信息的获取,利用soup库进行信息查找和整理。最后进行输出,写入txt文件中
代码的实现如下
import requests
from bs4 import BeautifulSoup
def movie(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/58.0.3029.110 Safari/537.3",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
}
#print("craw html:", url)
def write(file_name, data):
with open(file_name, "w", encoding="utf-8") as file:
file.write(data)
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
outputs = soup.find_all('p', {'class': 'name'})
#print(outputs)
with open("output.txt", "a", encoding="utf-8") as file:
for output in outputs:
file.write(output.text + "\n")
for a in range(0, 100, 10):
url = f"https://www.maoyan.com/board/4?offset={a}"
movie(url)
遇到的问题
电影榜单需要翻页,找出每页的网址的规律,利用函数进行循环,依次打出10页的内容,完成输出。有时候爬取内容获取不到或者不全,需要进入网页进行一下验证后再次爬取。
标签:name,url,text,电影,爬虫,soup,html,file,猫眼 From: https://www.cnblogs.com/darling1004/p/17833594.html