爬虫概述
爬虫
善:不破坏被爬取的网站的资源
恶:影响网站的正常运营(抢票,秒杀,使网站资源宕机)
爬虫的矛与盾
反爬机制
反反爬策略
robots.txt协议
第一个爬虫
from urllib.request import urlopen
url = "http://www.baidu.com"
resp = urlopen(url)
content = resp.read().decode("utf-8")
# open("test.html", mode="w", encoding="utf-8").write(content)
open("test.html", mode="r", encoding="utf-8").read()
requests 模块
需要先安装requests
pip install requests
如果下载过慢,可更改源
# requests 测试
import requests
url = "http://www.baidu.com"
rest = requests.get(url)
rest.encoding = "utf-8"
print(rest.text)
get 请求
import requests
content = input("请输入您要搜索的内容:")
url = f"https://www.sogou.com/web?query={content}"
headers = {
# 添加一个请求头信息 UA
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.39"
}
rest = requests.get(url, headers = headers)
print(rest.text)
post 请求
import requests
url = "https://fanyi.baidu.com/sug"
content = {
"kw" : input("test")
}
rest = requests.post(url, data=content)
print(rest.json())
标签:url,基础,爬虫,rest,content,import,requests
From: https://www.cnblogs.com/sroot/p/17414855.html