目录
process_start_requests
spider:
import scrapy
from scrapy import Request
from scrapyspidermiddlerwaredemo.items import DemoItem
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['www.httpbin.org']
start_url = 'https://www.httpbin.org/get'
def start_requests(self):
for i in range(5):
url = f'{self.start_url}?query={i}'
yield Request(url, callback=self.parse)
def parse(self, response):
print(response.text)
process_start_requests: 请求spider的初始请求
class CustomizeMiddlerware(object):
def process_start_requests(self, start_requests, spider):
'''
处理start_url
:param start_requests:
:param spider:
:return:
'''
for request in start_requests:
url = request.url
url += '&name=tom'
request = request.replace(url=url)
yield request
输出结果:可以看到请求的url 已被修改
process_spider_input
HttpbinSpider - parse() 方法:
def parse(self, response):
print(f'response status:{response.status}')
print(response.text)
CustomizeMiddlerware - process_start_requests
class CustomizeMiddlerware(object):
def process_start_requests(self, start_requests, spider):
'''
处理spider 开始时的request
:param start_requests:
:param spider:
:return:
'''
for request in start_requests:
url = request.url
url += '&name=tom'
request = request.replace(url=url)
yield request
def process_spider_input(self, response, spider):
'''
处理 Response,修改响应码
:return:
'''
response.status = 201
输出结果:可以看到响应状态码已被修改
process_spider_output
parse:
def parse(self, response):
print(f'response status:{response.status}')
item = DemoItem(**response.json())
yield item
CustomizeMiddlerware - process_spider_output:
class CustomizeMiddlerware(object):
def process_spider_output(self, response, result, spider):
'''
处理Item对象
:param response:
:param result:
:param spider:
:return:
'''
for i in result:
if isinstance(i, DemoItem):
i['origin'] = None
yield i
输出结果:
标签:process,url,Middlerware,Spider,spider,start,Scrapy,requests,response From: https://www.cnblogs.com/czzz/p/16988344.html