首页 > 其他分享 >scrapy爬取图片报错Missing scheme in request url:

scrapy爬取图片报错Missing scheme in request url:

时间:2023-08-17 11:12:27浏览次数:38  
标签:Missing url 08 request chinaz 报错 net scheme

记录一下scrapy爬取图片遇到的坑

  1. 目标站点:站长素材图片 https://sc.chinaz.com/tupian
  2. 我的核心源代码
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    import scrapy
    # useful for handling different item types with a single interface
    from itemadapter import ItemAdapter
    # 导入图片处理管道类
    from scrapy.pipelines.images import ImagesPipeline
    
    class ImagesPipeLine(ImagesPipeline):
        def get_media_requests(self, item, info):
            img_path = item['img_path']
            yield scrapy.Request(url=img_path)
    
        def file_path(self, request, response=None, info=None):
            img_name = request.url.split('/')[-1]
            return img_name
    
        def item_completed(self, results, item, info):
            return item
  3. 报如下错
        self._line = linecache.getline(self.filename, self.lineno).strip()
      File "D:\install\anaconda3\lib\linecache.py", line 30, in getline
        lines = getlines(filename, module_globals)
      File "D:\install\anaconda3\lib\linecache.py", line 46, in getlines
        return updatecache(filename, module_globals)
      File "D:\install\anaconda3\lib\linecache.py", line 137, in updatecache
        lines = fp.readlines()
      File "D:\install\anaconda3\lib\codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 288: invalid start byte
    Call stack:
      File "D:\install\anaconda3\Scripts\scrapy-script.py", line 10, in <module>
        sys.exit(execute())
      File "D:\install\anaconda3\lib\site-packages\scrapy\cmdline.py", line 154, in execute
        _run_print_help(parser, _run_command, cmd, args, opts)
      File "D:\install\anaconda3\lib\site-packages\scrapy\cmdline.py", line 109, in _run_print_help
        func(*a, **kw)
      File "D:\install\anaconda3\lib\site-packages\scrapy\cmdline.py", line 162, in _run_command
        cmd.run(args, opts)
      File "D:\install\anaconda3\lib\site-packages\scrapy\commands\crawl.py", line 27, in run
        self.crawler_process.start()
      File "D:\install\anaconda3\lib\site-packages\scrapy\crawler.py", line 348, in start
        reactor.run(installSignalHandlers=False)  # blocking call
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\base.py", line 1315, in run
        self.mainLoop()
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
        reactorBaseSelf.runUntilCurrent()
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\base.py", line 991, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\task.py", line 680, in _tick
        taskObj._oneWorkUnit()
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\task.py", line 526, in _oneWorkUnit
        result = next(self._iterator)
      File "D:\install\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 86, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
      File "D:\install\anaconda3\lib\site-packages\scrapy\core\scraper.py", line 207, in _process_spidermw_output
        dfd.addBoth(self._itemproc_finished, output, response, spider)
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\defer.py", line 538, in addBoth
        return self.addCallbacks(
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\defer.py", line 477, in addCallbacks
        self._runCallbacks()
      File "D:\install\anaconda3\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks
        current.result = callback(  # type: ignore[misc]
      File "D:\install\anaconda3\lib\site-packages\scrapy\core\scraper.py", line 267, in _itemproc_finished
        logger.log(*logformatter_adapter(logkws), extra={'spider': spider},
    Message: 'Error processing %(item)s'
    Arguments: {'item': {'img_path': '//scpic1.chinaz.net/files/default/imgs/2023-08-14/d0ef55a2d710556e_s.jpg'}}
    --- Logging error ---
  4. 排查步骤

    其实上述报错我是没看明白的,不知道从何下手,报错信息太多了。之后我改造我的代码,把可能报错的代码用try except包起来,然后打印异常,结果就更简洁了。改造后代码和打印信息如下:

    def get_media_requests(self, item, info):
        try:
            # 发送请求
            img_path = item['img_path']
            yield scrapy.Request(img_path)
        except Exception as e:
            print(e)
            pass





打印结果

Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/175877a01a833b7b_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-14/d0ef55a2d710556e_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-05/df791032f53f5179_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-05/c3cbafa5efd7c9ab_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-09/83640b0005a19b14_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-07/da2193a1cc52969b_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/4283b939393b18f6_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/8aaae68f0126a080_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/7a4ea6cc5a00bd3b_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/e898b7f60eed9d1f_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-14/e241fbe4d407735c_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-13/0a126e9116b6e42a_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/c2f8d261e10a52d8_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/cc7e32b1e8668549_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/707accf28feb2e19_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/30755627eaa7e8a9_s.png
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/cfb35c6a5806f448_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/5c66bd93443af90d_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/256cfb52805042c2_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/0fc78e89f09b6ee1_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/89c3c26897434973_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-13/68f8aa9f8e3d2721_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-14/51476979c441e36d_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-13/c4ec043560a2d430_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/e1c7a0385d287cf7_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/2c58476097790381_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/bd272dac73b8076e_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-12/047e9c5da4f3bf35_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-09/785dc12b836a4c30_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-16/ab27815136279b59_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/a61a30d1cc7a57f5_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/895682e8d1955b77_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/20728bc7977fd9db_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-14/a6d94e3ba56f8e24_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-13/6f58296053538a39_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-14/a7f68bc50f78ebb9_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/f1a3803dfe4137ce_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/ba608ba73e84a56d_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-15/fa99e71609dcfa67_s.jpg
Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-09/c2d2991420a43430_s.jpg

  看着这个日志信息“Missing scheme in request url: //scpic1.chinaz.net/files/default/imgs/2023-08-09/c2d2991420a43430_s.jpg”提示缺失什么。仔细看这个url:“//scpic1.chinaz.net/files/default/imgs/2023-08-09/c2d2991420a43430_s.jpg”,感觉也没有问题,我把这个地址复制到浏览器允许也是能正常显示图片的。但是总感觉这个地址有问题,查阅了很多资料,最终解决办法是给这个url地址拼接上http:/https:改造后的代码如下

    def get_media_requests(self, item, info):
        try:
            # 发送请求
            img_path = "https://" + item['img_path']
            yield scrapy.Request(img_path)
        except Exception as e:
            print(e)
            pass

  修改后再去运行就可以正常爬取图片了。这里记录一下爬取的地址尽量完整

标签:Missing,url,08,request,chinaz,报错,net,scheme
From: https://www.cnblogs.com/xlisteven/p/17637099.html

相关文章

  • C# Microsoft.Win32.TaskScheduler方式创建任务计划程序报错: System.ArgumentExceptio
    使用Microsoft.Win32.TaskScheduler创建任务计划程序可参考本人之前的一篇文章:https://www.cnblogs.com/log9527blog/p/17329755.html最新发现个别账户使用Microsoft.Win32.TaskScheduler创建任务计划程序报错:System.ArgumentException:(12,21):UserId:Account一种情况是账户......
  • t113-c-curl使用
    先复制文件到要用到的项目中注意:复制.so文件在linux中不可以直接复制粘贴,必须要在linux命令行输入cp到指定的路径才可以,不然只会复制到软链接改一下cmakecmake_minimum_required(VERSION3.12.4)SET(CMAKE_C_COMPILER/home/momo/Tina-Linux/prebuilt/gcc/linux-x86/arm/too......
  • SLAM十四讲 Sophus编译报错
    在学习SLAM十四讲的时候,需要使用Sophus,但是在编译Sophus的过程中,总是存在报错。在执行make-j2后,我遇到的错误提示其中有一段是:Eigen::Map<RxSO3Type>shallow_copy=map_of_rxso3;,还有cc1plus:allwarningsbeingtreatedaserrors解决的办法:打开CMakeLists.txt文件,修......
  • URL长链接转短链接
    一、短链接技术1.简介短链接技术是一种将长URL映射为短URL的技术。简单来说,就是通过一个简化的算法,将输入的长URL转换为一个短URL字符串,这个字符串可以按照短URL本身的需求进行设计,比如可以使用一定的字符集,并且限制字符串长度。2.短链接的优点短链接技术的主要优点包......
  • 解决 composer 安装 laravel 报错
    执行#composercreate-projectlaravel/laravellaravel9报错Creatinga"laravel/laravel"projectat"./laravel9"Installinglaravel/laravel(v10.2.1) -Downloadinglaravel/laravel(v10.2.1) -Installinglaravel/laravel(v10.2.1):Extract......
  • vite打包报错:ERROR: Top-level await is not available in the configured target env
    在开发时,vita打包报错如下: 原因:ECMAScript提案Top-levelawait由MylesBorins提出,它可以让你在模块的最高层中使用await操作符。在这之前,你只能通过在async函数或asyncgenerators中使用await操作符。Top-levelawait是个新特性,打包不支持此特性。解决方案:1.......
  • springmvc 开启异步请求报错 Java code using the Servlet API or by adding "true"
    报错内容:java.lang.IllegalStateException:Asyncsupportmustbeenabledonaservletandforallfiltersinvolvedinasyncrequestprocessing.ThisisdoneinJavacodeusingtheServletAPIorbyadding"true"toservletandfilterdeclarationsin......
  • 报错TypeError: Cannot read properties of null (reading 'length')
    可能是某个数组属性不存在,但是判断了他的长度,比如下方代码<templatev-if="arr.length"><divv-for="(item,idx)inarr":key="idx">{{item}}</div></template>解决方法  v-if="arr......
  • 【YOLOv5】一些网上找不到答案的报错解决方案
    AssertionError:Labelclass4exceedsnc=4in/xxxxxx解决方法原因一般是由于数据集的标签文件出现的类别数与自己设置的nc类别数不符。例如我原本要检测5类目标,则标签序号为0、1、2、3、4,nc数设置为5,如图:而‘4’这一类mAP太低,所以想去掉,于是只把nc数改为4,所以出现报错。......
  • 2023-08-16 flutter之执行命令【flutter doctor】报错:Android sdkmanager tool not fo
    步骤如下: 注:请确保你有足够的空间安装并运行as工具。 ......