首页 > 其他分享 >Scrapy框架进阶攻略:代理设置、请求优化及链家网实战项目全解析

Scrapy框架进阶攻略:代理设置、请求优化及链家网实战项目全解析

时间:2024-08-16 20:26:59浏览次数:8  
标签:进阶 self request Scrapy headers scrapy proxy 链家网 response

scrapy框架

加代理

付费代理IP池

middlewares.py

# 代理IP池
class ProxyMiddleware(object):
    proxypool_url = 'http://127.0.0.1:5555/random'
    logger = logging.getLogger('middlewares.proxy')

    async def process_request(self, request, spider):
        async with aiohttp.ClientSession() as client:
            response = await client.get(self.proxypool_url)
            if not response.status == 200:
                return
            proxy = await response.text()
            self.logger.debug(f'set proxy {proxy}')
            request.meta['proxy'] = f'http://{proxy}'

settings.py

DOWNLOADER_MIDDLEWARES = {
    "demo.middlewares.DemoDownloaderMiddleware": 543,
    "demo.middlewares.ProxyMiddleware": 544
}

隧道代理

import base64

proxyUser = "1140169503666491392"
proxyPass = "7RmCwS8r"
proxyHost = "http-short.xiaoxiangdaili.com"
proxyPort = "10010"

proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
    "host": proxyHost,
    "port": proxyPort,
    "user": proxyUser,
    "pass": proxyPass
}
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")


# 隧道代理
class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta["proxy"] = proxyServer
        request.headers["Connection"] = "close"
        request.headers["Proxy-Authorization"] = proxyAuth
        # 60秒一切 变为 10秒一切
        request.headers["Proxy-Switch-Ip"] = True

重试机制

settings.py

# Retry settings
RETRY_ENABLED = False
RETRY_TIMES = 5  # 想重试几次就写几
# 下面这行可要可不要
# RETRY_HTTP_CODES = [500, 502, 503, 504, 408]

重写已有重试中间件

midderwares.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware

retry.py

    def _retry(self, request, reason, spider):
        max_retry_times = request.meta.get("max_retry_times", self.max_retry_times)
        priority_adjust = request.meta.get("priority_adjust", self.priority_adjust)
        # 重试更换代理IP
        proxypool_url = 'http://127.0.0.1:5555/random'
        logger = logging.getLogger('middlewares.proxy')

        async def process_request(self, request, spider):
            async with aiohttp.ClientSession() as client:
                response = await client.get(self.proxypool_url)
                if not response.status == 200:
                    return
                proxy = await response.text()
                self.logger.debug(f'set proxy {proxy}')
                request.meta['proxy'] = f'http://{proxy}'
        request.headers['Proxy-Authorization'] = "proxyauth"
        return get_retry_request(
            request,
            reason=reason,
            spider=spider,
            max_retry_times=max_retry_times,
            priority_adjust=priority_adjust,
        )

零碎知识点

scrapy两种请求方式

  1. GET请求

    import scrapy
    yield scrapy.Request(begin_url,self.first)
    
  2. POST请求

    from scrapy import FormRequest ##Scrapy中用作登录使用的一个包
    formdata = {    'username': 'wangshang',    'password': 'a706486'}
    yield scrapy.FormRequest( 
    url='http://172.16.10.119:8080/bwie/login.do',
    formdata=formdata,   
    callback=self.after_login,
    )
    

    应用场景:POST请求并且携带加密token,我们需要伪造POST请求并且解密token

scrapy个性化配置

settings.py

custom_settings_for_centoschina_cn = {
'DOWNLOADER_MIDDLEWARES' : {
   'questions.middlewares.QuestionsDownloaderMiddleware': 543,
},
'ITEM_PIPELINES': {
   'questions.pipelines.QuestionsPipeline': 300,
},
'MYSQL_URI' : '124.221.206.17',
# 'MYSQL_URI' : '43.143.155.25',
'MYSQL_DB' : 'mydb',
'MYSQL_USER':'root',
'MYSQL_PASSWORD':'123456',

}

爬虫部分

import scrapy
from questions.settings import custom_settings_for_centoschina_cn
from questions.items import QuestionsItem
from lxml import etree
class CentoschinaCnSpider(scrapy.Spider):
    name = 'centoschina.cn'
    # allowed_domains = ['centoschina.cn']
    custom_settings = custom_settings_for_centoschina_cn

3种方式加headers

  1. settings.py的默认headers

    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
       "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
       "Accept-Language": "en",
    }
    
  2. 每个请求加headers

    headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
        }
    
        def start_requests(self):
            start_url = "https://2024.ip138.com/"
            for n in range(5):
                # dont_filter=True, 去掉框架自带相同链接去重机制
                yield scrapy.Request(start_url, self.get_info, dont_filter=True, headers=A2024Ip138Spider.headers)
    
  3. 下载器中间件加headers

     def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
            # 加header
            request.headers[
                'user-agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
    

优先级:3 > 2 > 1

request携带参数, response获取参数

    def start_requests(self):
        start_url = "https://2024.ip138.com/"
        for n in range(5):
            # dont_filter=True, 去掉框架自带相同链接去重机制
            yield scrapy.Request(start_url, self.get_info, dont_filter=True, headers=A2024Ip138Spider.headers,
                                 meta={'page': 1})

    def get_info(self, response):
        # print(response.text)
        print(response.meta['page'])
        ip = response.xpath('/html/body/p[1]/a[1]/text()').extract_first()
        print(ip)

链家(scrapy项目)

项目介绍:不封IP

核心代码

import scrapy


class TjLianjiaSpider(scrapy.Spider):
    name = "tj_lianjia"

    # allowed_domains = ["ffffffffff"]
    # start_urls = ["https://ffffffffff"]
    def __init__(self):
        self.page = 1

    def start_requests(self):
        start_url = 'https://tj.lianjia.com/ershoufang/pg{}/'.format(self.page)
        yield scrapy.Request(start_url, self.get_info)

    def get_info(self, response):
        lis = response.xpath('//li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')
        for li in lis:
            title = li.xpath('div[1]/div[@class="title"]/a/text()').extract_first()
            totalprice = ''.join(li.xpath('div[1]/div[@class="priceInfo"]/div[1]//text()').extract())
            print(title, totalprice)
        self.page += 1
        next_href = 'https://tj.lianjia.com/ershoufang/pg{}/'.format(self.page)
        yield scrapy.Request(next_href, self.get_info)

更多精致内容

在这里插入图片描述
在这里插入图片描述

标签:进阶,self,request,Scrapy,headers,scrapy,proxy,链家网,response
From: https://blog.csdn.net/m0_74087660/article/details/141267556

相关文章

  • 【CPP】C++模板:初阶到进阶语法与实用编程示例
    关于我:睡觉待开机:个人主页个人专栏:《优选算法》《C语言》《CPP》生活的理想,就是为了理想的生活!作者留言PDF版免费提供:倘若有需要,想拿我写的博客进行学习和交流,可以私信我将免费提供PDF版。留下你的建议:倘若你发现本文中的内容和配图有任何错误或改进建......
  • python入门篇-day05-函数进阶
    函数返回多个值不管返回多少都会默认封装成一个元组的形式返回(可以自定义返回的数据类型,如:[]{})思考:1个函数可以同时返回多个结果(返回值)?答案:错误,因为1个函数只能返回1个结果,如果同时返回多个值则会封装成1个元素进行返回.演示#需求:定义函数my_cal......
  • C# WPF现代化开发:绑定、模板与动画进阶
    ......
  • Datawhale AI 夏令营-天池Better Synth多模态大模型数据合成挑战赛-task2探索与进阶(
    在大数据、大模型时代,随着大模型发展,互联网数据渐尽且需大量处理标注,为新模型训练高效合成优质数据成为新兴问题。“天池BetterSynth-多模态大模型数据合成挑战赛”应运而生,旨在探究合成数据对多模态大模型训练的影响及高效合成方法策略,推动多模态大模型数据合成创新。比赛关......
  • 前端进阶——浏览器篇
     浏览器如何工作(一)进程架构 浏览器的工作过程复杂而高效,其核心在于其进程架构的设计。以下是对浏览器进程架构的详细解析:一、浏览器的主要进程现代浏览器大多采用多进程多线程的架构,以Chrome浏览器为例,其主要进程包括:浏览器进程(BrowserProcess):功能:作为浏览器的主进......
  • SQL进阶技巧:数据清洗如何利用组内最近不为空的数据填充缺失值。【埋点日志事件缺失值
    目录0引言1问题描述2数据准备 3问题分析4小结0引言  在用户行为分析中,我们往往需要对用户浏览行为进行分析或获客的渠道进行分析,在埋点日志中用户一个session中会浏览不同的界面,会进行url的跳转,在前端埋点时,往往将用户刚进入界面时的url进行存储,后续在当前......
  • 【二叉树进阶】--- 二叉搜索树转双向链表 && 最近公共祖先
     Welcometo9ilk'sCodeWorld    (๑•́₃•̀๑) 个人主页:     9ilk(๑•́₃•̀๑) 文章专栏:   数据结构本篇博客我们继续了解一些二叉树的进阶算法。......
  • 进阶 Java冒泡排序递归法 有点难度哦
    简介这里有用到递归的冒泡排序思路,难度对新手很大,不明白递归和冒泡排序的小伙子可以先看看我的其他两个文章。连接在这里:Java冒泡排序https://blog.csdn.net/ouhexie/article/details/140985984?spm=1001.2014.3001.5501Java递归算法https://blog.csdn.net/ouhexie/articl......
  • C#进阶-ASP.NET实现可以缩放和旋转的图片预览页
    本文详细介绍了如何在ASP.NETWebForms中实现一个功能丰富的图片预览页面。通过结合HTML、CSS和JavaScript,用户可以方便地对图片进行放大、缩小以及旋转操作。文章从页面的基本布局开始,逐步讲解了如何设置图片展示区、添加控制按钮、编写CSS样式以及实现JavaScript功能,最终展示了......
  • 【动画进阶】神奇的卡片 Hover 效果与 Blur 的特性探究
    本文,我们将一起探讨探讨,如下所示的一个卡片Hover动画,应该如何实现:这个效果的几个难点:鼠标移动的过程中,展示当前卡片边缘的border以及发光效果;效果只出现在鼠标附近?这一块的实现方法就有很多种了,可以计算鼠标附近的范围,在范围内去实现的效果,但是这样成本太高了。转换一......