爬虫 - IPS99技术分享

标签：self 中间件爬虫 item print div

爬虫定义

网络爬虫（又称为网页蜘蛛，网络机器人），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

当我们输入url向服务器发起请求时，服务器会返回数据(html,json等格式的数据)。

这里我们获取了html文件，我们可以进一步分析html文件获取我们想要的数据。

1.向服务器请求数据

我们可以通过python的requests这个库，来模拟浏览器请求数据

发现数据和在浏览器中请求的数据是一样的。

2.从获取的数据中分析出想要的数据

这里如果我们想要获取这个inq数据，我们可以通过正则表达式来获取。如果不会正则表达式，我们可以使用python中的BeautifulSoup库来解析数据。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

通过以上两步我们就可以获取我们想要的数据了。

scrapy框架

这个框架集成了上述的请求、解析数据等功能，同时也有下载和保存数据等功能。

(1) 调度器(Scheduler)

调度器，说白了把它假设成为一个URL（抓取网页的网址或者说是链接）的优先队列，由它来决定下一个要抓取的网址是什么，同时去除重复的网址（不做无用功）。用户可以自己的需求定制调度器。

(2) 下载器(Downloader)

下载器，是所有组件中负担最大的，它用于高速地下载网络上的资源。Scrapy 的下载器代码不会太复杂，但效率高，主要的原因是 Scrapy 下载器是建立在 twisted 这个高效的异步模型上的(其实整个框架都在建立在这个模型上的)。

(3) 爬虫（Spider）

爬虫，是用户最关心的部份。用户定制自己的爬虫(通过定制正则表达式等语法)，用于从特定的网页中提取自己需要的信息，即所谓的实体(Item)。例如使用 Xpath 提取感兴趣的信息。
用户也可以从中提取出链接，让Scrapy继续抓取下一个页面。

(4) 实体管道(Item Pipeline)

实体管道，用于接收网络爬虫传过来的数据，以便做进一步处理。例如验证实体的有效性、清除不需要的信息、存入数据库（持久化实体）、存入文本文件等。

(5) Scrapy引擎(Scrapy Engine)

Scrapy 引擎是整个框架的核心，用来处理整个系统的数据流，触发各种事件。它用来控制调试器、下载器、爬虫。实际上，引擎相当于计算机的CPU，它控制着整个流程。

(6) 中间件

整个 Scrapy 框架有很多中间件，如下载器中间件、网络爬虫中间件等，这些中间件相当于过滤器，夹在不同部分之间截获数据流，并进行特殊的加工处理。

流程

1）爬虫中起始的 URL 构造成 Requests 对象爬虫中间件引擎调度器；
2）调度器把 Requests 引擎下载中间件下载器；
3）下载器发送请求，获取 Responses 响应下载中间件引擎爬虫中间件爬虫；
4）爬虫提取 URL 地址，组装成 Requests 对象爬虫中间件引擎调度器，重复步骤2；
5）爬虫提取数据引擎管道处理和保存数据；

具体实现

主要有_init_.py items.py middlewares.py pipelines.py setting.py 还有一个自定义的爬虫主文件。

items.py

主要是用来定义你的item中的field属性

class TextItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    rank=scrapy.Field()
    title=scrapy.Field()
    point=scrapy.Field()
    quote=scrapy.Field()

settings.py

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en",
}

ITEM_PIPELINES = {
    "text.pipelines.TextPipeline": 300,
}

pipelines.py

这个文件是用来处理数据的

def __init__(self):
        self.f=open('txt.json',"w+",encoding="utf-8")
        if(self.f):
            print("open")
        else:
            print("flase")

        self.f.write('[\n')

    
    def process_item(self, item, spider):
        content=json.dumps(dict(item),ensure_ascii=False);
        self.f.write('\t'+content+'\n')
        self.f.write(',')

        return item

    def close_spider(self,spider):
        self.f.close()
        self.f=open('txt.json',"rb+")

        self.f.seek(-1,2)
        #print(self.f.read())
        #print(self.f.tell())
        self.f.truncate(self.f.tell())
        #self.f.write(']')
        self.f.close()
        self.f=open('txt.json',"a",encoding="utf-8")
        self.f.write(']')
        self.f.close()

doubanScrapy.py

class DoubanscrapySpider(scrapy.Spider):
    name = "doubanScrapy"
    allowed_domains = ["movie.douban.com"]
    start_urls = ["https://movie.douban.com/top250"]

    def parse(self, response):
        global i
        global count
        count+=1
        Txt=response.xpath('//ol[@class="grid_view"]/li')
        newfoldername="Rank"
        if not os.path.exists(newfoldername):
            os.mkdir(newfoldername);
        items=[]
        #print(Txt)
        for txt in Txt:
            #print("ppp")
            title=txt.xpath('./div/div[@class="info"]/div/a/span/text()').extract_first()
            rank=i
            bd=txt.xpath('./div/div[@class="info"]/div[@class="bd"]/div/span[@class="rating_num"]/text()').extract_first()
            #print(bd)
            bd=bd+"分"
            quote=txt.xpath('./div/div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract_first()
            i+=1
            item=TextItem()
            #print(title)
            item['title']=title
            item['rank']=str(rank)
            item['point']=bd
            item['quote']=quote
            #print(title)
            yield item

标签：self,中间件,爬虫,item,print,div
From： https://www.cnblogs.com/imomi3/p/17369839.html

爬虫