先搞单页网站:
import requests
from lxml import etree
url = 'https://*********.com/top250?start=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Cookie': '3',
}
response = requests.get(url, headers=headers)
data = response.content.decode()
tree = etree.HTML(data)
div_list = tree.xpath('//div[@class="hd"]')
# print(div_list)
for div in div_list:
title = re.sub('\s', '', ''.join(div.xpath('.//text()')))
print(title)
# 执行结果
霸王别姬/再见,我的妾/FarewellMyConcubine[可播放]
阿甘正传/ForrestGump/福雷斯特·冈普[可播放]
泰坦尼克号/Titanic/铁达尼号(港/台)[可播放]
这个杀手不太冷/Léon/终极追杀令(台)/杀手莱昂[可播放]
千与千寻/千と千尋の神隠し/神隐少女(台)/千与千寻的神隐
美丽人生/Lavitaèbella/一个快乐的传说(港)/LifeIsBeautiful[可播放]
星际穿越/Interstellar/星际启示录(港)/星际效应(台)[可播放]
标签:星际,web,Python,list,千与千寻,headers,div,可播放,crawler From: https://www.cnblogs.com/Magiclala/p/17980772