scrapyd
scrapyd介绍
Scrapyd是一个用于部署和运行Scrapy爬虫项目的应用程序,由Scrapy的开发者开发。以下是Scrapyd的主要用法和作用:
用法:
安装Scrapyd服务器:使用pip命令安装Scrapyd,然后在命令行中启动Scrapyd服务。
安装Scrapyd客户端:同样使用pip命令安装Scrapyd的客户端,这样你就可以使用客户端来部署和管理你的Scrapy项目。
配置Scrapy项目:进入你的Scrapy项目根目录,找到scrapy.cfg文件并进行相应的配置。你需要在[deploy]部分指定Scrapyd服务器的URL,以及项目的名称。
发布项目:使用Scrapyd的客户端命令来发布你的Scrapy项目到Scrapyd服务器。你需要指定目标服务器和项目的名称。
作用:
部署Scrapy项目:Scrapyd允许你通过简单的JSON API将Scrapy项目上传到服务器,这使得部署过程变得简单且高效。
运行和控制爬虫:一旦项目被部署到Scrapyd服务器,你就可以通过API来控制爬虫的运行,包括启动、停止和调度爬虫任务。
管理多个项目和版本:Scrapyd可以管理多个Scrapy项目,并且每个项目还可以上传多个版本。但只有最新版本的项目会被使用,这有助于保持项目的更新和迭代。
监听和处理爬虫请求:Scrapyd作为一个守护进程,会监听爬虫的运行和请求,并根据请求为每一个爬虫启用一个进程来运行。它还支持同时运行多个进程,可以根据需要进行配置。
总的来说,Scrapyd为Scrapy爬虫项目的部署和运行提供了方便和高效的解决方案,使得开发者能够更轻松地管理和控制他们的爬虫任务。
deploy
0.11 新版功能.
语法: scrapy deploy [ <target:project> | -l <target> | -L ]
是否需要项目: yes
将项目部署到Scrapyd服务。查看 部署您的项目 。
scrapy 增量式爬取方式
# 如何来进行去除重复的问题
1、 使用Python自带的set集合去重。
2、 推荐使用redis的set集合去除重复
使用redis有两个方案去除重复
1. url, 优点:简单,缺点,如果URL内部(详情页)进行了更新,你可能会忽略掉一些数据
2. 数据,优点:准确性搞,缺点,如果数据非常庞大,对于redis而言是非常不利的
先在init方法写red 初始化redis
result = self.red.sadd("tianya:ty:detail:url",detail_url)
if result:
yield scrapy.Request(
url = detail_url,
callback = self.parse_detail
)
else:
print("已经爬取过")
redis相关配置
# redis
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 301 # 可选项
}
# redis相关配置
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
REDIS_DB = 13
REDIS_PARAMS = {
"password":'chen'
}
# scrapy_redis相关配置
# scrapy-redis配置信息 # 固定的
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True # 如果为真,在关闭时自动保存请求信息,如果为假,则不保存请求信息
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
因为redis太渺小,万一平常要爬取的数据非常的庞大,这时候默认的过滤器就不灵验了,所以此时需要一个特殊的过滤器
布隆过滤器
平时,我们如果需要对数据进行去重操作可以有以下方案:
- 直接用set结合来存储url
- 用redis来存储hash过的url,scrapy默认就是这样做的
- 用redis来存储hash过的请求,scrapy-redis默认就是这样做的,如果请求非常非常多,redis压力是很大的
- 用布隆过滤器
布隆过滤器的原理:其实它里面就是一个改良版的bitmap. 何为bitmap,假设我提前准备好一个数组,然后把源数据经过hash计算,会计算出一个数字,我们按照下标来找到该下标对应的位置,然后设置成1.
在scrapy-redis中想要使用布隆过滤器是非常简单的,你可以自己去写这个布隆过滤器的逻辑,不过我建议直接使用第三方的就可以了
# 安装布隆过滤器
pip install scrapy_redis_bloomfilter
# 去重类,要使用 BloomFilter 请替换 DUPEFILTER_CLASS
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"
# 哈希函数的个数,默认为6,可以自行修改
BLOOMFILTER_HASH_NUMBER = 6
# bloomfilter 的bit参数,默认30,占用128M空间,去重量级 1 亿
BLOOMFILTER_BIT = 30
scrapy-splash的使用
安装docker
yum install docker
配置docker的源
vim /etc/docker/daemon.json
# 写入内容
{
"registry-mirrors":["https://cr.console.aliyun.com"]
}
{
"registry-mirrors":["https://registry.docker-cn.com/"]
}
systemctl start docker
docker ps
安装splash
-
拉取splash镜像
docker pull scrapinghub/splash
slpash有2个G左右
-
运行splash
docker run -p 8050:8050 scrapinghub/splash
-
打开浏览器访问splash
能访问上方网址即可表示splash启动成功
报错解决注意
# 1、先关闭selinux模式
# 2、关闭防火墙
# 3、更新yum 注意安装的docker版本
sudo yum remove docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-engine
sudo yum install -y yum-utils \
device-mapper-persistent-data \
lvm2
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io
sudo systemctl start docker
sudo systemctl enable docker
sudo docker run hello-world
splash执行API
重要内如(splash连接):192.168.86.151:8050/render.html?url=(这里填写要渲染的url)[&wait=1&time_out=1]
爬取网易新闻,进行滚轮下拉页面
function main(splash, args)
assert(splash:go("https://news.163.com/"))
assert(splash:wait(2))
-- 准备一个js函数,预加载
get_btn_display = splash:jsfunc([[
function(){
return document.getElementsByClassName("load_more_btn")[0].style.display
}
]])
while(true)
do
splash:runjs("document.getElementsByClassName('load_more_btn')[0].scrollIntoView(true)")
splash:select(".load_more_btn").click()
splash:wait(1)
-- 判断load_more_btn是否是none
display_value = get_btn_display()
if(display_value == 'none')
then
break
end
end
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
使用scrapy框架结合splash需要使用设置
先安装包: pip install scrapy_splash
# scrapy_splash
# 渲染服务的url,这里换成你自己的
SPLASH_URL = 'http://192.168.86.151:8050'
# 下载器中间件,这个必须要配置
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 这个可有可无
# SPIDER_MIDDLEWARES = {
# 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
# }
# 去重过滤器,这个必须要配置
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http缓存,这个如果不需要可以注释掉
# HTTPCACHE_ENABLED = True # 如果您要使用Http缓存,需要启用这个
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
当启动虚拟机ip突然消失
1. 方案一
和 NetworkManager 服务有冲突,这个好解决,直接关闭 NetworkManger 服务就好了,即:
首先:
# centos 7
systemctl stop NetworkManager
systemctl disable NetworkManager
然后重启服务器,输入:reboot
重启即可
再次查看是否解决问题:
systemctl restart network
*** 想要使用分布式+Splash+bloom
1、第一步:先创建自己的过滤器文件duperfilter.py
from scrapy_redis.dupefilter import RFPDupeFilter as BaseRFPDupeFilter
from scrapy.utils.url import canonicalize_url
from scrapy.utils.request import request_fingerprint
from scrapy_splash.utils import dict_hash
from copy import deepcopy
import logging
import time
from scrapy_redis_bloomfilter.defaults import BLOOMFILTER_HASH_NUMBER, BLOOMFILTER_BIT, DUPEFILTER_DEBUG
from scrapy_redis_bloomfilter import defaults
from scrapy_redis.connection import get_redis_from_settings
from scrapy_redis_bloomfilter.bloomfilter import BloomFilter
# from scrapy_redis.dupefilter import RFPDupeFilter as BaseDupeFilter
logger = logging.getLogger(__name__)
# # TODO 下面这部分是scrapy_splash下的duperfilter 源码内容
def splash_request_fingerprint(request, include_headers=None):
""" Request fingerprint which takes 'splash' meta key into account """
fp = request_fingerprint(request, include_headers=include_headers)
if 'splash' not in request.meta:
return fp
splash_options = deepcopy(request.meta['splash'])
args = splash_options.setdefault('args', {})
if 'url' in args:
args['url'] = canonicalize_url(args['url'], keep_fragments=True)
return dict_hash(splash_options, fp)
# 现在这个玩意已经可以兼容Splash和redis
# 继承RedisDupeFilter.
# 把Splash的东西,复制过来了
# 把bloom的东西全部拷贝过来
# TODO 需要注意 这里要继承的父类是 scrapy_redis下的(from scrapy_redis.dupefilter import RFPDupeFilter as BaseRFPDupeFilter)
class MydupeFilter(BaseRFPDupeFilter):
# TODO 下面这部分是scrapy_redis_bloom下的duperfilter 源码内容
logger = logger
def __init__(self, server, key, debug, bit, hash_number):
"""Initialize the duplicates filter.
Parameters
----------
server : redis.StrictRedis
The redis server instance.
key : str
Redis key Where to store fingerprints.
debug : bool, optional
Whether to log filtered requests.
"""
self.server = server
self.key = key
self.debug = debug
self.bit = bit
self.hash_number = hash_number
self.logdupes = True
self.bf = BloomFilter(server, self.key, bit, hash_number)
@classmethod
def from_settings(cls, settings):
"""Returns an instance from given settings.
This uses by default the key ``dupefilter:<timestamp>``. When using the
``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
it needs to pass the spider name in the key.
Parameters
----------
settings : scrapy.settings.Settings
Returns
-------
RFPDupeFilter
A RFPDupeFilter instance.
"""
server = get_redis_from_settings(settings)
# XXX: This creates one-time key. needed to support to use this
# class as standalone dupefilter with scrapy's default scheduler
# if scrapy passes spider on open() method this wouldn't be needed
# TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
debug = settings.getbool('DUPEFILTER_DEBUG', DUPEFILTER_DEBUG)
bit = settings.getint('BLOOMFILTER_BIT', BLOOMFILTER_BIT)
hash_number = settings.getint('BLOOMFILTER_HASH_NUMBER', BLOOMFILTER_HASH_NUMBER)
return cls(server, key=key, debug=debug, bit=bit, hash_number=hash_number)
@classmethod
def from_crawler(cls, crawler):
"""Returns instance from crawler.
Parameters
----------
crawler : scrapy.crawler.Crawler
Returns
-------
RFPDupeFilter
Instance of RFPDupeFilter.
"""
instance = cls.from_settings(crawler.settings)
return instance
@classmethod
def from_spider(cls, spider):
"""Returns instance from crawler.
Parameters
----------
spider :
Returns
-------
RFPDupeFilter
Instance of RFPDupeFilter.
"""
settings = spider.settings
server = get_redis_from_settings(settings)
dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)
key = dupefilter_key % {'spider': spider.name}
debug = settings.getbool('DUPEFILTER_DEBUG', DUPEFILTER_DEBUG)
bit = settings.getint('BLOOMFILTER_BIT', BLOOMFILTER_BIT)
hash_number = settings.getint('BLOOMFILTER_HASH_NUMBER', BLOOMFILTER_HASH_NUMBER)
print(key, bit, hash_number)
instance = cls(server, key=key, debug=debug, bit=bit, hash_number=hash_number)
return instance
def request_seen(self, request):
"""Returns True if request was already seen.
Parameters
----------
request : scrapy.http.Request
Returns
-------
bool
"""
fp = self.request_fingerprint(request)
# This returns the number of values added, zero if already exists.
if self.bf.exists(fp):
return True
self.bf.insert(fp)
return False
def log(self, request, spider):
"""Logs given request.
Parameters
----------
request : scrapy.http.Request
spider : scrapy.spiders.Spider
"""
if self.debug:
msg = "Filtered duplicate request: %(request)s"
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
elif self.logdupes:
msg = ("Filtered duplicate request %(request)s"
" - no more duplicates will be shown"
" (see DUPEFILTER_DEBUG to show all duplicates)")
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
self.logdupes = False
spider.crawler.stats.inc_value('bloomfilter/filtered', spider=spider)
# TODO 下面这部分是scrapy_splash下的duperfilter 源码内容
def request_fingerprint(self, request):
return splash_request_fingerprint(request)
2、在scrapy的settings文件中要设置为
SPLASH_URL = 'http://192.168.86.151:8050'
# 下载器中间件,这个必须要配置
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 这个可有可无
# SPIDER_MIDDLEWARES = {
# 'scrapy_splash.SplashDeduplicateArgsMidd;eware':100,
#}
# 更换为自己的过滤器,同时兼容Redis,Splash,bloom
DUPEFILTER_CLASS = 'news.dupefilter.MydupeFilter'
# redis
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 301 # 可选项
}
# redis相关配置
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
REDIS_DB = 12
# scrapy_redis 相关配置 固定的
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True # 如果为真,在关闭时自动保存请求信息,如果为假,则不保存请求信息
# DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 去重的逻辑,要用redis的
# 布隆过滤器
# 去重类,要使用 BloomFilter 请替换 DUPEFILTER_CLASS
# DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"
# 哈希函数的个数,默认为6,可以自行修改
BLOOMFILTER_HASH_NUMBER = 6
# bloomfilter 的bit参数,默认30,占用128M空间,去重量级 1 亿
BLOOMFILTER_BIT = 30
3、scrapy代码注意点
- RedisSpider: 类要改为继承RedisSpider(from scrapy_redis.spiders import RedisSpider)
import scrapy
from scrapy_splash.request import SplashRequest
from scrapy_redis.spiders import RedisSpider
from ..items import NewsItem
lua_source = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(2))
-- 准备一个js函数,预加载
get_btn_display = splash:jsfunc([[
function(){
return document.getElementsByClassName("load_more_btn")[0].style.display;
}
]])
while(true)
do
splash:runjs("document.getElementsByClassName('load_more_btn')[0].scrollIntoView(true)")
splash:select(".load_more_btn").click()
splash:wait(1)
-- 判断load_more_btn是否是none
display_value = get_btn_display()
if display_value == 'none' then
break
end
end
return {html = splash:html()}
end
"""
class WangyiSpider(RedisSpider):
name = 'wangyi'
allowed_domains = ['163.com']
start_urls = ['https://news.163.com/']
# redis_key = "wangyi:news:start_urls"
# 重写start_request
def start_requests(self):
yield SplashRequest(
url=self.start_urls[0],
callback=self.parse,
endpoint="execute", # 终端表示你要执行的哪一个splash服务
args={
"lua_source": lua_source
},
dont_filter=True
)
def parse(self, response,**kwargs):
divs = response.xpath('//div[@class="ndi_main"]/div[@class="data_row news_article clearfix "]')
for div in divs:
a = div.xpath('./div/div[@class="news_title"]/h3/a/@href').extract_first()
b = div.xpath('./div/div[@class="news_title"]/h3/a/text()').extract_first()
xw = NewsItem()
xw['title'] = a
xw['url'] = b
print(xw)
yield xw
标签:教程,redis,self,request,实测,scrapy,splash,docker
From: https://www.cnblogs.com/it-cyj/p/18216402/scrapy