使用正则表达式和爬虫
爬虫实例一:
# 第好几个方法实例 import requests #先导入爬虫的库,不然调用不了爬虫的函数 import re #下面是可以正常爬取的区别,更改了User-Agent字段 headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.289 Safari/537.36" }#设置头部信息,伪装浏览器 response = requests.get( "https://github.com/" , headers=headers ) #get方法访问,传入headers参数, print( response.text ) #获取网页所有的源码内容 pattern='<div class="(.*?)">(.*?)</div>' #正则表达式 result=re.findall(pattern=pattern, string=response.text) print(result)
Python爬虫白名单网站:https://www.pythonanywhere.com/whitelist/
爬虫实例二:
# 第好几个方法实例 import requests #先导入爬虫的库,不然调用不了爬虫的函数 import re #下面是可以正常爬取的区别,更改了User-Agent字段 headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" }#设置头部信息,伪装浏览器 response = requests.get( "https://www.pythonanywhere.com/whitelist/" , headers=headers ) #get方法访问,传入headers参数, #print( response.text ) #获取网页所有的源码内容 pattern1='<td style="width:20ex;">(.*?)</td>' #正则表达式 # pattern2='<link rel="(.*?)">' result1=re.findall(pattern=pattern1, string=response.text) # result2=re.findall(pattern=pattern2, string=response.text) #print(result1) # print() # print(result2) for res in result1: print(res)
输出结果:
标签:re,Python,text,爬虫,headers,实例,print,response From: https://www.cnblogs.com/longlyseul/p/18123942