首页 > 其他分享 >scrapy框架之自定义扩展

scrapy框架之自定义扩展

时间:2023-12-08 15:26:41浏览次数:33  
标签:engine 自定义 框架 self spider telnet signals scrapy crawler

自定义扩展时,利用信号在指定位置注册制定操作

源码剖析:

1 from scrapy.extensions.telnet import TelnetConsole #查看TelnetConsole源码
2 
3 # Enable or disable extensions
4 # See https://doc.scrapy.org/en/latest/topics/extensions.html
5 EXTENSIONS = {
6    'scrapy.extensions.telnet.TelnetConsole': None,
7    # 'test002.extensions.MyExtend':300,
8 }

查看 TelnetConsole 类:

 1 class TelnetConsole(protocol.ServerFactory):
 2 
 3     def __init__(self, crawler):
 4         if not crawler.settings.getbool('TELNETCONSOLE_ENABLED'):
 5             raise NotConfigured
 6         if not TWISTED_CONCH_AVAILABLE:
 7             raise NotConfigured
 8         self.crawler = crawler
 9         self.noisy = False
10         self.portrange = [int(x) for x in crawler.settings.getlist('TELNETCONSOLE_PORT')]
11         self.host = crawler.settings['TELNETCONSOLE_HOST']
12         self.crawler.signals.connect(self.start_listening, signals.engine_started)
13         self.crawler.signals.connect(self.stop_listening, signals.engine_stopped)
14 
15     @classmethod
16     def from_crawler(cls, crawler):
17         return cls(crawler)
18 
19     def start_listening(self):
20         self.port = listen_tcp(self.portrange, self.host, self)
21         h = self.port.getHost()
22         logger.debug("Telnet console listening on %(host)s:%(port)d",
23                      {'host': h.host, 'port': h.port},
24                      extra={'crawler': self.crawler})
25 
26     def stop_listening(self):
27         self.port.stopListening()
28 
29     def protocol(self):
30         telnet_vars = self._get_telnet_vars()
31         return telnet.TelnetTransport(telnet.TelnetBootstrapProtocol,
32             insults.ServerProtocol, manhole.Manhole, telnet_vars)
33 
34     def _get_telnet_vars(self):
35         # Note: if you add entries here also update topics/telnetconsole.rst
36         telnet_vars = {
37             'engine': self.crawler.engine,
38             'spider': self.crawler.engine.spider,
39             'slot': self.crawler.engine.slot,
40             'crawler': self.crawler,
41             'extensions': self.crawler.extensions,
42             'stats': self.crawler.stats,
43             'settings': self.crawler.settings,
44             'est': lambda: print_engine_status(self.crawler.engine),
45             'p': pprint.pprint,
46             'prefs': print_live_refs,
47             'hpy': hpy,
48             'help': "This is Scrapy telnet console. For more info see: " \
49                 "https://doc.scrapy.org/en/latest/topics/telnetconsole.html",
50         }
51         self.crawler.signals.send_catch_log(update_telnet_vars, telnet_vars=telnet_vars)
52         return telnet_vars

分析:

self.start_listening&self.stop_listening 是可以自定义的方法

signals.engine_started&signals.engine_stopped 是指定信号

在指定信号上注册操作

查找信号:

进入signals查看

 1 engine_started = object()
 2 engine_stopped = object()
 3 spider_opened = object()
 4 spider_idle = object()
 5 spider_closed = object()
 6 spider_error = object()
 7 request_scheduled = object()
 8 request_dropped = object()
 9 response_received = object()
10 response_downloaded = object()
11 item_scraped = object()
12 item_dropped = object()
13 
14 # for backwards compatibility
15 stats_spider_opened = spider_opened
16 stats_spider_closing = spider_closed
17 stats_spider_closed = spider_closed
18 
19 item_passed = item_scraped
20 
21 request_received = request_scheduled

根据上面源码,我们可以源码进行自定扩展:

 1 from scrapy import signals
 2 
 3 class MyExtend:
 4 
 5     def __init__(self,crawler):
 6         self.crawler = crawler
 7         # 钩子上挂障碍物
 8         # 在指定信号上注册操作
 9         self.crawler.signals.connect(self.start,signals.engine_started)
10         self.crawler.signals.connect(self.close,signals.spider_closed)
11 
12     @classmethod
13     def from_crawler(cls,crawler):
14         return cls(crawler)
15 
16     def start(self):
17         print('signals.engine_started')
18 
19     def close(self):
20         print('signals.spider_closed')
1 from scrapy.extensions.telnet import TelnetConsole
2 
3 # Enable or disable extensions
4 # See https://doc.scrapy.org/en/latest/topics/extensions.html
5 EXTENSIONS = {
6    # 'scrapy.extensions.telnet.TelnetConsole': None,
7    'test002.extensions.MyExtend':300,
8 }

标签:engine,自定义,框架,self,spider,telnet,signals,scrapy,crawler
From: https://www.cnblogs.com/huangm1314/p/10440203.html

相关文章

  • scrapy框架之格式化&持久化
    格式化处理在parse方法中直接处理是简单的处理方式,不太建议,如果对于想要获取更多的数据处理,则可以利用Scrapy的items将数据格式化,然后统一交由pipelines来处理以爬取校花网校花图片相关信息为例:1importscrapy2fromscrapy.selectorimportHtmlXPathSelector3froms......
  • scrapy框架之配置文件1
    部分配置文件详解:1#-*-coding:utf-8-*-23#Scrapysettingsfortest001project4#5#Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor6#commonlyused.Youcanfindmoresettingsconsultingthedocumentation:7......
  • scrapy框架之配置文件2
    ①自动限速算法1"""217.自动限速算法3fromscrapy.contrib.throttleimportAutoThrottle4自动限速设置51.获取最小延迟DOWNLOAD_DELAY62.获取最大延迟AUTOTHROTTLE_MAX_DELAY73.设置初始下载延迟AUTOTHROTTLE_START_DELAY8......
  • scrapy框架之Twisted
     ①getPage11#socket对象(如果下载完成..自动从事件循环中移除)22fromtwisted.web.clientimportgetPage详解:1defgetPage(url,contextFactory=None,*args,**kwargs):2"""3Downloadawebpageasastring.45Downloadapage.Retu......
  • scrapy框架之自定制命令
    自定制命令1.在spiders同级创建任意目录,如:commands2.在其中创建crawlall.py文件(此处文件名就是自定义的命令)#crawlall.py文件1fromscrapy.commandsimportScrapyCommand2fromscrapy.utils.projectimportget_project_settings345classCommand(Scra......
  • scrapy框架之中间件
    一.下载中间件(DownMiddleware)三种方法:1process_request(self,request,spider)23process_response(self,request,response,spider)45process_exception(self,request,exception,spider)详细:1classDownMiddleware1(object):2#第①执行3......
  • scrapy框架之自定义简易scrapy框架
    自定义low版Scrapy框架:1fromtwisted.internetimportreactor#事件循环(终止条件,所有的socket都已经移除)2fromtwisted.web.clientimportgetPage#socket对象(如果下载完成..自动从事件循环中移除)3fromtwisted.internetimportdefer#defer.Deferred特殊的soc......
  • 12、QT自定义消息框
     #include"dialog.h"Dialog::Dialog(QWidget*parent):QDialog(parent){resize(260,80);glayout=newQGridLayout(this);labelmsg=newQLabel("自定义消息框");msgButton=newQPushButton("测试操作");labeldis......
  • zabbix使用自定义监控项权限不足
    一、背景zabbix在配置好个人监控项后,最开始基本上都会遇到权限不足的问题。二、解决方法两种解决方法:1、配置sudoers,然后再key值执行命令前加sudo(相对来说比较麻烦)2、配置zabbix_agent#zabbix-agent配置echo"AllowRoot=1">>/etc/zabbix/zabbix_agentd.conffind/-na......
  • Mybatis Plus 自定义 TypeHandler
    在MyBatisPlus中,可以自定义TypeHandler来处理特殊的类型转换。下面是如何自定义一个TypeHandler的步骤:我们需要创建一个实现org.apache.ibatis.type.TypeHandler接口的类。这个类需要实现以下几个方法:setParameter(PreparedStatementps,inti,Tparameter,JdbcTypejdbc......