python基础爬虫
基于beautifulSoup的爬虫:
一:先导包:
import requests
from bs4 import BeautifulSoup
二:伪装:
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0'}
user-agent在浏览器按f12 -> 网络 -> 消息头
![9b5ca505410b9e0e05642346274f29e](C:\Users\86187\Documents\WeChat Files\wxid_a2c2u6t80tcy22\FileStorage\Temp\9b5ca505410b9e0e05642346274f29e.png)
三:获取爬取页面对象、设置编码格式(以防万一)、获取beautifulSoup对象:
response = requests.get("", headers=headers)
response.encoding = 'utf-8'
html=BeautifulSoup(response.text,"html.parser")
![bcf2b50102676cccbf286031bbc44ff](C:\Users\86187\Documents\WeChat Files\wxid_a2c2u6t80tcy22\FileStorage\Temp\bcf2b50102676cccbf286031bbc44ff.png)
解析器写第一种就行
四:查看需爬取网页源码确定查找内容:
all_results=html.findAll("标签名",attrs={'关键字':'关键字名'})
如:
![9736c3ec6ac538e85c724532655a242](C:\Users\86187\Documents\WeChat Files\wxid_a2c2u6t80tcy22\FileStorage\Temp\9736c3ec6ac538e85c724532655a242.png)
五:遍历查找结果并只输出标签内文本:
for title in all_results:
for title in all_results:
title1 = title.get_text()
print(title1)
示例:
随机挑选一位幸运儿
![6e515175dbd92d09963265e07d8e86e](C:\Users\86187\Documents\WeChat Files\wxid_a2c2u6t80tcy22\FileStorage\Temp\6e515175dbd92d09963265e07d8e86e.png)
完整代码:
import requests
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0'}
#遍历翻页
for i in range(1,20):
response = requests.get(f"https://www.cnblogs.com/xxxxxxxxx?page={i}", headers=headers)
response.encoding = 'utf-8'
html=BeautifulSoup(response.text,"html.parser")
all_results=html.findAll("a",attrs={'class':'postTitle2 vertical-middle'})
for title in all_results:
title1 = title.get_text()
print(title1)
结果:
![402facfa4cb5a8929985893e77bc21d](C:\Users\86187\Documents\WeChat Files\wxid_a2c2u6t80tcy22\FileStorage\Temp\402facfa4cb5a8929985893e77bc21d.png)
标签:Files,title,python,基础,爬虫,results,headers,html,response From: https://www.cnblogs.com/lksjfd/p/18002164