正则爬取糗事百科热图:
第一步:找到网址
我们分析Header,是Get请求
下面开始撸代码:
import requests
import json,time
import re,os
上面先导入库
没有的pip intstall 安装库,可以通过国内镜像源安装
主题函数:
def get_url(page):
url="https://www.qiushibaike.com/imgrank/page /{}/".format(page)
# url="https://www.qiushibaike.com/imgrank/"
headers={
'User - Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 89.0.4389.82Safari / 537.36'
}
response=requests.get(url,headers=headers).text
获取整个页面的html
下面我们要用正则:
我们可以先cop 网页的elements
到本地,用正则测试,例如
import re
html='<div class="content"><span>妈妈带狗子出门,忘记带雨衣了——FB:J L</span></div>'
res='<div class="content".*?<span>(.*?)</span>.*?</div>'
resp=re.findall(res,html,re.S)
print(resp)
我们可以在本地进行测试:
下面是正则:
res='<div class="thumb".*?<img src="(.*?) alt.*?</div>'
resp = '<div class="content".*?<span>(.*?)</span>.*?</div>'
picture=re.findall(res,response,re.S)
title=re.findall(resp,response,re.S)
# print(title)
# print(picture)打印测试
for img,title in zip(picture,title):
img_url="https:"+img
title_name=title.strip()
# print(title)
# print(img_url,title)
time.sleep(1)
response=requests.get(url=img_url,headers=headers)
with open(path+'./%s.jpg'%title_name,'ab') as f:
f.write(response.content)
print(title+"图片下载成功!")
完整代码:
import requests
import json,time
import re,os
def get_url(page):
path="./糗事百科"
url="https://www.qiushibaike.com/imgrank/page /{}/".format(page)
# url="https://www.qiushibaike.com/imgrank/"
headers={
'User - Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 89.0.4389.82Safari / 537.36'
}
response=requests.get(url,headers=headers).text
res='<div class="thumb".*?<img src="(.*?) alt.*?</div>'
resp = '<div class="content".*?<span>(.*?)</span>.*?</div>'
picture=re.findall(res,response,re.S)
title=re.findall(resp,response,re.S)
# print(title)
# print(picture)
if not os.path.exists(path):
os.mkdir(path)
for img,title in zip(picture,title):
img_url="https:"+img
title_name=title.strip()
# print(title)
# print(img_url,title)
time.sleep(1)
response=requests.get(url=img_url,headers=headers)
with open(path+'./%s.jpg'%title_name,'ab') as f:
f.write(response.content)
print(title+"图片下载成功!")
if __name__ == '__main__':
for i in range(1,13):
time.sleep(2)
get_url(i)
我们上面可以用for循环实现翻页操做.
糗事百科正则爬取就完成了
效果图片: