Scrapy中间件
学习目标:
- 应用 scrapy中使用中间件使用随机UA的方法
- 了解 scrapy中使用代理ip的的方法
1、scrapy中间件的分类和作用
1.1 scrapy中间件的分类
根据scrapy运行流程中所在位置不同分为:
- 下载中间件
- 爬虫中间件
1.2 scrapy中间的作用
- 主要功能是在爬虫运行过程中进行一些处理,如对非200响应的重试(重新构造Request对象yield给引擎)
- 也可以对header以及cookie进行更换和处理
- 其他根据业务需求实现响应的功能
但在scrapy默认的情况下 两种中间件都在middlewares.py一个文件中
爬虫中间件使用方法和下载中间件相同,常用下载中间件
2、下载中间件的使用方法:
接下来我们对爬虫进行修改完善,通过下载中间件来学习如何使用中间件 编写一个Downloader Middlewares和我们编写一个pipeline一样,定义一个类,然后在setting中开启
Downloader Middlewares默认的方法:在中间件类中,有时需要重写处理请求或者响应的方法】
-
process_request(self, request, spider):【此方法是用的最多的】
- 当每个request通过下载中间件时,该方法被调用。
- 返回None值:继续请求 没有return也是返回None,该request对象传递给下载器,或通过引擎传递给其他权重低的process_request方法 【如果所有的下载器中间件都返回为None,则请求最终被交给下载器处理】
- 返回Response对象:不再请求,把response返回给引擎【如果返回为请求,则将请求交给调度器】
- 返回Request对象:把request对象交给调度器进行后续的请求
-
process_response(self, request, response, spider):
- 当下载器完成http请求,传递响应给引擎的时候调用
- 返回Resposne:通过引擎交给爬虫处理或交给权重更低的其他下载中间件的process_response方法 【如果返回为请求,则将请求交给调度器】
- 返回Request对象:交给调取器继续请求,此时将不通过其他权重低的process_request方法 【将响应对象交给spider进行解析】
-
process_exception(self, request, exception, spider):
-
请求出现异常的时候进行调用
-
比如当前请求被识别为爬虫 可以使用代理
def process_exception(self, request, exception, spider): request.meta['proxy'] = 'http://ip地址' request.dont_filter = True # 因为默认请求是去除重复的,因为当前已经请求过,所以需要设置当前为不去重 return request # 将修正后的对象重新进行请求
-
-
在settings.py中配置开启中间件,权重值越小越优先执行 【同管道的注册使用】
-
spider参数:为爬虫中类的实例化可以在这里进行调用爬虫中的属性
如:spider.name
2.1 当前中间件的简单使用
spiders.py(爬虫代码)news.163.com/domestic/
class WySpider(scrapy.Spider):
name = 'wy'
start_urls = ['http://www.baidu.com/'] # 给一个正确的网址
# start_urls = ['http://www.baidu123.com/'] # 给一个错误的网址
def parse(self, response, **kwargs):
pass
settings.py(开启中间件)
DOWNLOADER_MIDDLEWARES = {
'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
}
middlewares.py
class WangyiDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
print('process_request')
return None
def process_response(self, request, response, spider):
print('process_response')
return response
def process_exception(self, request, exception, spider):
print('process_exception')
return request
运行查看
修改spiders.py 给一个错误的网址在进行查看,会发现当前会循环执行
3、定义实现随机User-Agent的下载中间件
3.1 在settings中添加UA的列表
USER_AGENTS_LIST = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]
3.2 在middlewares.py中完善代码
from wangyi.settings import USER_AGENTS_LIST
import random
import time
class WangyiDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# 设定随机UA
ua = random.choice(USER_AGENTS_LIST)
request.headers['User-Agent'] = ua
return None
def process_response(self, request, response, spider):
# 获取selenium对象
driver = spider.driver
# 判断当前请求的url是否为 国内\国际的url 是则执行selenium 否则不处理
if request.url in spider.page_url:
driver.get(request.url)
driver.execute_script(f'window.scrollBy(0, document.body.scrollHeight)')
time.sleep(1)
driver.execute_script(f'window.scrollBy(0, document.body.scrollHeight)')
time.sleep(1)
text = driver.page_source # 获取页面内容
# 篡改响应
return HtmlResponse(url=request.url, body=text, encoding='UTF-8', request=request)
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
3.3 在爬虫文件wy.py的每个解析函数中添加
import scrapy
from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.chrome.options import Options
class WySpider(scrapy.Spider):
name = 'wy'
# allowed_domains = ['news.163.com/domestic']
start_urls = ['http://news.163.com/domestic/']
# 处理selenium
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
# 防止检测
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option, chrome_options=chrome_options)
href_index = [1, 2] # 索引为1 和2的 才是国内国际的url
page_url = [] # 存储我们需要抓取国内和国际的url地址 用于selenium进行请求
def parse(self, response, **kwargs):
# 抓取国内国际的url
href_list = response.xpath('/html/body/div/div[3]/div[2]/div[2]/div/ul/li/a/@href').extract()
for i in range(len(href_list)):
# 如果当前索引存在于href_index 则进行抓取 因为当前为国内和国际的url
if i in self.href_index:
url = href_list[i] # 获取url
self.page_url.append(url)
yield scrapy.Request(url, callback=self.parse_detail)
# 处理请求国内和国际的请求
def parse_detail(self, response, **kwargs):
print(response.text)
detail_url = response.xpath('/html/body/div/div[3]/div[3]/div[1]/div[1]/div/ul/li/div/div/a/@href').extract()
for url in detail_url:
yield scrapy.Request(url, callback=self.parse_detail_con)
def parse_detail_con(self, response, **kwargs):
title = response.xpath('//*[@id="container"]/div[1]/h1/text()').extract_first()
con = response.xpath('//*[@id="content"]/div[2]//text()').extract_first()
data = {'title': title, 'con': con}
print(data)
yield data
3.4 在settings中设置开启自定义的下载中间件,设置方法同管道
DOWNLOADER_MIDDLEWARES = {
'Tencent.middlewares.UserAgentMiddleware': 543,
}
3.5 运行爬虫观察现象
4、代理ip的使用
4.1 思路分析
-
代理添加的位置:request.meta中增加
proxy
字段 -
获取一个代理ip,赋值给
request.meta['proxy']
- 代理池中随机选择代理ip
- 代理ip的webapi发送请求获取一个代理ip
4.2 具体实现
class ProxyMiddleware(object):
def process_request(self,request,spider):
proxy = random.choice(proxies) # proxies可以在settings.py中,也可以来源于代理ip的webapi
# proxy = 'http://192.168.1.1:8118'
request.meta['proxy'] = proxy
return None # 可以不写return
4.3 检测代理ip是否可用
在使用了代理ip的情况下可以在下载中间件的process_response()方法中处理代理ip的使用情况,如果该代理ip不能使用可以替换其他代理ip
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status != '200' and response.status != '302':
#此时对代理ip进行操作,比如删除
return request
4.4 快代理的购买与使用
-
输入网址 点击购买代理
-
选择你想购买代理的类型
-
以隧道代理为例 点击购买
-
购买后点击 文档中心
-
点击
-
选择隧道代理
-
向下拉 选择你当前要使用代理的模块
我们是scrapy使用隧道 所以选择 Python-Scrapy
-
找到middlewares.py
-
将中间件类代码复制到当前自己scrapy的中间件得文件中即可
-
按照步骤开启中间件与填写自己的用户名与密码等信息即可
总结
中间件的使用:
- 完善中间件代码:
- process_request(self, request, spider):
- 当每个request通过下载中间件时,该方法被调用。
- 返回None值:继续请求
- 返回Response对象:不再请求,把response返回给引擎
- 返回Request对象:把request对象交给调度器进行后续的请求
- process_response(self, request, response, spider):
- 当下载器完成http请求,传递响应给引擎的时候调用
- 返回Resposne:交给process_response来处理
- 返回Request对象:交给调取器继续请求
- process_request(self, request, spider):
- 需要在settings.py中开启中间件 DOWNLOADER_MIDDLEWARES =
5、爬取网易新闻
5.1 爬取前准备
- scrapy startproject wangyi
- cd wangyi
- scrapy genspider wy https://news.163.com/
5.2 爬取前分析
抓取 国内 国际 军事 航空
-
分析
国内等数据是由动态加载的 并不是跟着当前的请求一起返回的
解决方式2种
-
通过selenium配合爬虫抓取页面进行数据
-
找到加载动态数据的url地址 通过爬虫进行抓取
将找到的URL放到浏览器中进行请求 效果如下
-
5.3 代码配置
-
配置文件处理settings.py
# Scrapy settings for wangyi project BOT_NAME = 'wangyi' SPIDER_MODULES = ['wangyi.spiders'] NEWSPIDER_MODULE = 'wangyi.spiders' # 默认请求头 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36' # 用于更换随机请求头 USER_AGENTS_LIST = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ] LOG_LEVEL = 'ERROR' ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { 'wangyi.middlewares.WangyiDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'wangyi.pipelines.WangyiPipeline': 300, }
-
爬虫代码wy.py
import scrapy from selenium.webdriver import Chrome, ChromeOptions from selenium.webdriver.chrome.options import Options class WySpider(scrapy.Spider): name = 'wy' # allowed_domains = ['news.163.com/domestic'] start_urls = ['http://news.163.com/domestic/'] # 处理selenium chrome_options = Options() chrome_options.add_argument("--headless") chrome_options.add_argument('--disable-gpu') # 防止检测 option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) driver = Chrome(options=option, chrome_options=chrome_options) href_index = [1, 2] # 索引为1 和2的 才是国内国际的url page_url = [] # 存储我们需要抓取国内和国际的url地址 用于selenium进行请求 def parse(self, response, **kwargs): # 抓取国内国际的url href_list = response.xpath('/html/body/div/div[3]/div[2]/div[2]/div/ul/li/a/@href').extract() for i in range(len(href_list)): # 如果当前索引存在于href_index 则进行抓取 因为当前为国内和国际的url if i in self.href_index: url = href_list[i] # 获取url self.page_url.append(url) yield scrapy.Request(url, callback=self.parse_detail) # 处理请求国内和国际的请求 def parse_detail(self, response, **kwargs): print(response.text) detail_url = response.xpath('/html/body/div/div[3]/div[3]/div[1]/div[1]/div/ul/li/div/div/a/@href').extract() for url in detail_url: yield scrapy.Request(url, callback=self.parse_detail_con) def parse_detail_con(self, response, **kwargs): title = response.xpath('//*[@id="container"]/div[1]/h1/text()').extract_first() con = response.xpath('//*[@id="content"]/div[2]//text()').extract_first() data = {'title': title, 'con': con} print(data) yield data
-
Middlewares.py
# Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter from scrapy.http import HtmlResponse class WangyiSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) from wangyi.settings import USER_AGENTS_LIST import random import time class WangyiDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # 设定随机UA ua = random.choice(USER_AGENTS_LIST) request.headers['User-Agent'] = ua return None def process_response(self, request, response, spider): # 获取selenium对象 driver = spider.driver # 判断当前请求的url是否为 国内\国际的url 是则执行selenium 否则不处理 if request.url in spider.page_url: driver.get(request.url) driver.execute_script(f'window.scrollBy(0, document.body.scrollHeight)') time.sleep(1) driver.execute_script(f'window.scrollBy(0, document.body.scrollHeight)') time.sleep(1) text = driver.page_source # 获取页面内容 # 篡改响应 return HtmlResponse(url=request.url, body=text, encoding='UTF-8', request=request) return response def process_exception(self, request, exception, spider): pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)