实战
修改请求的User-Agent 有两种方法:
方法一:
直接在settings.py 中加入USER_AGENT 的配置,如
USER_AGENT = 'scrapyhttpbindemo (+http://www.yourdomain.com)'
方法二:
# 1.middlewares.py 中定义如下类
class RandomUserAgentMiddleware(object):
def __init__(self):
self.user_agents = [
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36',
'jemter'
]
def process_request(self,request,spider):
request.headers['User-Agent'] = random.choice(self.user_agents)
# 2. settings.py 中加入配置
DOWNLOADER_MIDDLEWARES = {
'scrapyhttpbindemo.middlewares.RandomUserAgentMiddleware': 543,
}
# before:
'''
text: {
"args": {},
"data": "{\"age\": \"26\", \"name\": \"wangcai\"}",
"files": {},
"form": {},
"headers": {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Content-Length": "32",
"Content-Type": "application/json",
"Host": "www.httpbin.org",
"User-Agent": "Scrapy/2.7.1 (+https://scrapy.org)",
"X-Amzn-Trace-Id": "Root=1-63953c12-0558fede42d9a64d5a632a8b"
},
"json": {
"age": "26",
"name": "wangcai"
},
"origin": "120.229.34.25",
"url": "https://www.httpbin.org/post"
}
'''
# after, 可以看到User-Agent 已经发生变化了:
'''
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Host": "www.httpbin.org",
"User-Agent": "jemter",
"X-Amzn-Trace-Id": "Root=1-6395d166-219f615432bf128f1bcac524"
},
"origin": "120.229.34.25",
"url": "https://www.httpbin.org/get"
}
'''
总结:
一般推荐方式一,比较简单。但是想要更加灵活那就需要借助Downloader Middlerware