初用scrapy 报错503 Service Unavailable问题

时间：2024-03-23 15:45:40浏览次数：31

标签：full Unavailable Service canshuhref url parse item scrapy 报错

毕设基于Hadoop的电子产品推荐系统

系统需要大量的电子产品信息，爬取的是中关村的数据（没有像京东一样的反爬机制）

使用scrapy spider 爬取页面信息中，可以获取部分页面数据，但爬取一些页面时，会报错503 Service Unavailable

部分代码详情

 def parse(self, response):
        if response.status == 503:
            # 处理503响应的逻辑，例如等待一段时间后重新发送请求
            yield scrapy.Request(response.url, callback=self.parse, dont_filter=True, meta=response.meta, priority=0)
        else:
            for li in response.xpath('//*[@id="J_PicMode"]/li'):
                item = {}

                item['href'] = li.xpath('a/@href').get()
                if item['href'] is not None:
                    full_url = 'https://detail.zol.com.cn' + item['href']
                    # print(full_url)
                    # 创建新的请求，并指定新的回调函数 parse_details
                    yield scrapy.Request(full_url, callback=self.parse_details, meta={'item': item})
                    # time.sleep(0.5)

    def parse_details(self, response):
        # 在这里执行针对详情页面的操作，可以继续使用 XPath 或其他方式提取所需的信息
        # 例如：
        item = response.meta['item']
        # item['imghref'] = response.xpath('//*[@id="_j_tag_nav"]/ul/li[3]/a/@href').get()
        # if item['imghref'] is not None:
        #     full_imghref_url = 'https://detail.zol.com.cn' + item['imghref']
        #     # print(full_imghref_url)
        #     # 创建新的请求，并指定新的回调函数 parse_canshuhref
        #     yield scrapy.Request(full_imghref_url, callback=self.parse_imghref, meta={'item': item})

        item['canshuhref'] = response.xpath('//*[@id="_j_tag_nav"]/ul/li[2]/a/@href').get()
        if item['canshuhref'] is not None:
            full_canshuhref_url = 'https://detail.zol.com.cn/' + item['canshuhref']

            # 创建新的请求，并指定新的回调函数 parse_canshuhref
            yield scrapy.Request(full_canshuhref_url, callback=self.parse_canshuhref, meta={'item': item})

这里制作简单的记录，经过多方查找也没有解决出现的问题，爬取一千个页面获取的也就300多条数据

今天爬取图片时，偶然发现代码中的问题

full_canshuhref_url = 'https://detail.zol.com.cn/' + item['canshuhref']

在我的python代码中所有用xpath获取的地址都是不全的（只有后半部分，前半部分是固定的域名），我使用了上面的这种方式合成了地址信息作为request的请求地址，在这过程中就出现了一个问题

由于通过xpath获取的href中开头都带有 / 我在地址整合时又加入了 https://detail.zol.com.cn/

地址就变成了https://detail.zol.com.cn//cell_phone/index1494545.shtml

可见中间部分合成了 //

虽然这样的地址确实可以通过浏览器搜索到，但是scrapy就会报错503 然后一直retry，最后只获取到少量的数据

（初学scrapy 函数命名不规范，只以此记录编程过程出现的问题）

标签：full,Unavailable,Service,canshuhref,url,parse,item,scrapy,报错
From： https://www.cnblogs.com/woaixing711/p/18091162

skynet框架：lua service支持监控告警
问题场景是：服务A生产大量请求消息call到服务B，服务B瞬间达到消费能力的瓶颈，导致服务A堆积大量的yield状态协程，服务B消息队列堆积大量待处理消息，业务上出现卡顿、超时甚至物理机器内存吃满被瞬间击穿的问题；我们使用云服务器产品部署游戏业务，起因是游戏线上收到反馈在某些时间节点频......
idea更新gitlab突然报错
年前还用的好好的项目年后回来更新直接失败了，前后端都白搭，一样的报错：Updatefailed/opt/gitlab/embedded/service/gitlab-shell/lib/gitlab_logger.rb:72:inwrite:Nospaceleftondevice@io_write-/var/log/gitlab/gitlab-shell/gitlab-shell.log(Errno::ENOSPC)f......
解决 scroll-view 组件 [Intervention] 报错
[Intervention]Ignoredattempttocancelatouchmoveeventwithcancelable=false,forexamplebecausescrollingisinprogressandcannotbeinterrupted解决报错如下图,因为事件冒泡,scroll-view组件的touchmove事件可以传递到模态框。于是我给scroll-view取......
innodb_undo_tablespaces导致Mysql启动报错
1.问题MySQL5.7设置innodb_undo_tablespaces=2报错如下：2020-06-09T04:40:07.800321-05:000[ERROR]InnoDB:Expectedtoopen2undotablespacesbutwasabletofindonly0undotablespaces.Settheinnodb_undo_tablespacesparametertothecorrectvalueandret......
戴尔windows服务器安装双系统报错For a UEFI installation, you must include an EFI
安装centos7.9的分区时候，提示：ForaUEFIinstallation,youmustincludeanEFISystemPartitiononaGPT-formatteddisk,mountedat/boot/efi网上有好多人说修改bios，用常规的usb去启动，不要UEFI的方式，但我的windows系统已经是GPT格式，且原来就有一个EFI，所以我还是用UEFI的方......
Docker-compose安装Elasticsearch启动报错（挂载volume后）
错误提示:"stacktrace":["org.elasticsearch.bootstrap.StartupException:ElasticsearchException[failedtobindservice];nested:AccessDeniedException[/usr/share/elasticsearch/data/nodes];"docker-compose.yamldocker-compose.yamlversion:......
【转载】解决安装或卸载软件时报错Error 1001 的问题
卸载或安装程序时出错1001：错误1001可能发生在试图更新、修复或卸载windowsos中的特定程序时。此问题通常是由于程序的先前安装损坏而引起的。错误“1001”通常会遇到，因为程序的先前安装被破坏或者由于Windows安装不处于正常状态(例如，注册表已经被恶意软件修改)。在这种情况......
4年前的React老项目打包报错解决问题处理过程
处理公司一个4年前React应用时，发现打包编译时会出现如题错误：Failedtominifythebundle.Error:index.71782de2.jsfromUglify]s 查看打包编译后源码错误位置：经过bing搜索引擎查找类似解决方式： https://github.com/sorrycc/blog/issues/68 1、npm安装 https://g......
MindSpore报错处理：TypeError: For 'set_context', the parameter device_id can not b
问题背景在使用MindSpore运行一个分子动力学模拟的测试程序时：frommindsporeimportcontextfrommindspore.nnimportAdamif__name__=="__main__":importsyssys.path.insert(0,'../..')fromspongeimportSponge,Molecule,ForceField,set_global......
记一次Redis报错问题
问题描述在Spring项目中使用了@Cacheable注解并且将缓存放入redis，当从Redis读取缓存时提示了反序列化异常，无法构造UnmodifiableMap，没有默认的构造函数CouldnotreadJSON:Cannotconstructinstanceof`org.apache.commons.collections4.map.UnmodifiableMap`(noCreators,......

初用scrapy 报错503 Service Unavailable问题

相关文章

赞助商

阅读排行