scrapy爬虫简单使用&python执行cmd命令程序

标签：python self cmd dushu item scrapy div response

1. 安装

pip install scrapy

2. scrapy简单运行以及架构

1. 项目创建以及运行

创建项目

aaa@localhost pyspace % scrapy startproject demo1
New Scrapy project 'demo1', using template directory '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/scrapy/templates/project', created in:
    /Users/aaa/app/pyspace/demo1

You can start your first spider with:
    cd demo1
    scrapy genspider example example.com

项目组成

scrapy爬虫简单使用&python执行cmd命令程序_html

1.spiders 文件夹存放的是爬虫文件。我们需要在spiders下面新增爬虫.py，下面创建。是实现爬虫核心功能的文件
2.items.py 定义数据结构
3.middlewares.py中间件，代理
4.pipelines.py 管道文件，用于处理下载数据的后续处理。里面只有一个类，可以自己定义多个。优先级是1-1000, 默认是300 优先级(值越小优先级越高)
5.settings.py 配置文件，比如是否遵守rebots 协议，User-Agent 定义等

创建爬虫文件

进入到项目的文件夹下面，然后创建spider

aaa@localhost pyspace % cd demo1
aaa@localhost demo1 % ls
demo1    scrapy.cfg
aaa@localhost demo1 % scrapy genspider baidu baidu.com
Created spider 'baidu' using template 'basic' in module:
  demo1.spiders.baidu

执行玩上面的命令，会在项目的spiders目录下面新建一个baidu.py 修改后内容如下：

import scrapy


class BaiduSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'baidu'
    # 允许访问的域名(这里不需要家http)
    allowed_domains = ['baidu.com']
    # 起始的url，指的是第一次访问的url
    start_urls = ['http://baidu.com/']

    # 执行start_urls 的回调方法，方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen(urls)
    def parse(self, response):
        print("======")
        pass

运行上面的爬虫

语法：

scrapy crawl 爬虫名称

这里需要注意：上面有个robots 协议，可以理解为约定哪些可以爬，哪些不可以爬，我们访问

https://www.baidu.com/robots.txt 可以看到相关的描述。

1》修改不遵守robots协议，修改settings.py

ROBOTSTXT_OBEY = True

将上面的配置修改为False，或者直接注释掉。

2》执行爬虫baidu

aaa@localhost demo1 % pwd
/Users/aaa/app/pyspace/demo1
aaa@localhost demo1 % scrapy crawl baidu

3》结果可以看到自己打印的信息

修改代码，定位到百度一下按钮元素

import scrapy


class BaiduSpider(scrapy.Spider):
    # 爬虫的名字
    name = 'baidu'
    # 允许访问的域名(这里不需要家http)
    allowed_domains = ['baidu.com']
    # 起始的url，指的是第一次访问的url
    start_urls = ['http://baidu.com/']

    # 执行start_urls 的回调方法，方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen(urls)
    def parse(self, response):
        # print("======")
        # 响应的是字符串
        # print(response.text)
        print("******")
        # 响应的是二进制数据
        # print(response.body)

        # response.xpath 可以直接用xpayh 方法来解析response 中的内容. 返回的是一个 scrapy.selector.unified.SelectorList
        subList = response.xpath('//*[@id="su"]')
        print(subList)
        print(subList.__class__)
        # 可以用下标拿第一个元素，会拿到对应的元素。 也可以直接用 extract_first 获取。
        # extract 和 extract_first 拿到的是一个我们获取的元素data
        print(subList[0].extract())
        print(subList[0].extract().__class__)
        print(subList.extract_first())
        print(subList.extract_first().__class__)
        # .get() 等价于 .extract_first()
        # print(subList.get())
        # 比如直接拿按钮的 value 属性
        # print(response.xpath('//*[@id="su"]/@value').extract_first())

当然可以用css 或者bs4 选择器：

subList = response.css('#su')

1》重新运行

scrapy crawl baidu

2》结果

******
[<Selector xpath='//*[@id="su"]' data='<input type="submit" id="su" value="百...'>]
<class 'scrapy.selector.unified.SelectorList'>
<input type="submit" id="su" value="百度一下" class="bg s_btn">
<class 'str'>
<input type="submit" id="su" value="百度一下" class="bg s_btn">
<class 'str'>
百度一下

2. 架构以及简单原理

1. 架构

1.引擎：自动运行，无需关注，会自动组织所有的请求对象，分发给下载器
2.下载器：从引擎处获取到请求对象后，请求数据
3.spiders：定义爬取的动作以及爬取的网站
4.调度器：有自己的调度规则
5.管道：按照一定的顺序对Item 进行处理。可以理解为对数据进行处理，一般落库、保存为文件写在管道里面

2.工作原理

scrapy爬虫简单使用&python执行cmd命令程序_ide_02

3. scrapy 例子

1. 爬取读书网

我们爬取，读书网里面类别为散文随笔的书籍信息，首页地址为:

https://www.dushu.com/book/1163_1.html

这里需要用到crawlspider，用于定义一些规则用于提取页面符合规则的数据，然后继续爬取。页面爬取规则如下：

allow=() 正则表达式，提取符合正则的链接
deny=() 正则表达式，拒绝符合正则的连接
allow_domains() 允许的域名
deny_domains=() 拒绝的域名
restrict_xpaths=() 提取符合xpath规则的连接
restract_css=() 提取符合css规则的连接

2. 创建项目以及运行

创建项目

scrapy startproject dushu
cd dushu
scrapy genspider -t crawl read_dushu www.dushu.com
# 查看现有的爬虫名称
aaa@localhost dushu % scrapy list
read_dushu

修改代码

1》修改read_dushu.py：item 数据结构用最简单的dict 字典数据类型

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则，效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加，第一页只显示13， 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载，不能获取src 属性
            item['src'] = div.xpath('./div/a/img/@data-original').extract_first()
            item['name'] = div.xpath('./div/a/img/@alt').extract_first()
            item['author'] = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            yield item

2》pipelines.py 修改

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class DushuPipeline:
    '''
    open_spider\close_spider 方法只会调用一次。 一般用于资源的打开和关闭
    '''


    def open_spider(self,spider):
        self.fp = open('dushu.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

3》修改settings.py，取消遵循robots 协议以及放开pipeline

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

...

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
}

运行项目

scrapy crawl read_dushu

修改代码的item，用数据结构代替dict数据类型

1》修改items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()
    author = scrapy.Field()

2》修改spiders/reader_dushu.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushu.items import DushuItem


class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则，效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加，第一页只显示13， 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载，不能获取src 属性
            src = div.xpath('./div/a/img/@data-original').extract_first()
            name = div.xpath('./div/a/img/@alt').extract_first()
            author = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            yield DushuItem(src=src, name=name, author=author)

3. 继续改造项目，将书详情的价格也爬取出来

实现的效果就是将读书网点击书籍后的价格也爬取出来。

修改items.py 增加价格price 字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    src = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()

修改spiders/read_dushu.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushu.items import DushuItem


class ReadDushuSpider(CrawlSpider):
    name = 'read_dushu'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1163_1.html']

    '''
    follow  表示是否追踪后面的代码。 也就是从后续的页面继续利用此规则。
    False: 只适用于当前页
    True: 后续爬取的页面继续利用规则，效果就是爬取的椰树会增加 (后续页面访问的时候页号会增加，第一页只显示13， 后面的用... 表示)
    '''
    rules = (
        Rule(LinkExtractor(allow=r'/book/1163_\d+.html'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 这里的item 是用一个dict 字典存取其数据
        item = {}
        div_list = response.xpath('//div[@class="bookslist"]/ul/li/div')
        for div in div_list:
            # data-original 表示图片是懒加载，不能获取src 属性
            src = div.xpath('./div/a/img/@data-original').extract_first()
            name = div.xpath('./div/a/img/@alt').extract_first()
            author = div.xpath('./p[1]/a[1]/text()|./p[1]/text()').extract_first()
            url = div.xpath('./div/a/@href').extract_first()
            url = "https://www.dushu.com" + url
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name, 'src': src, 'author': author})

    def parse_second(self, response):
        price = response.xpath('//div[@class="book-details"]//span/text()').get()
        name = response.meta['name']
        src = response.meta['src']
        author = response.meta['author']
        yield DushuItem(src=src, name=name, author=author, price=price)

测试运行

4. 继续改造，增加pipeline将图片下载下来

需要安装 pillow

pip install pillow

修改piplines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class DushuPipeline:
    '''
    open_spider\close_spider 方法只会调用一次。 一般用于资源的打开和关闭
    '''


    def open_spider(self,spider):
        self.fp = open('dushu.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()


class ImgsPipLine(ImagesPipeline):

    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'], meta={'item': item})

    # 返回图片名称即可, 路径在全局配置文件中进行配置
    def file_path(self, request, response=None, info=None):
        print("******")
        item = request.meta['item']
        filePath = item['name']
        return filePath

    def item_completed(self, results, item, info):
        return item

修改settings.py增加相关配置

LOG_LEVEL = "WARNING"
IMAGES_STORE = './result'   #文件保存路径
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
   #  这两个请求头是必须的，没有referer 访问图片会报错403 。
   'referer': 'https://www.dushu.com/book/1163_11.html',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36'
}

ITEM_PIPELINES = {
   'dushu.pipelines.DushuPipeline': 300,
   'dushu.pipelines.ImgsPipLine': 301,
}

测试

运行后会在根目录生成result 目录，然后下载相关的jpg 图片。

5. 修改scrapy日志级别

修改settings.py

LOG_LEVEL = "WARNING"

6. 编写main 类启动scrapy 程序

方法一

from scrapy.cmdline import execute
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)));
execute(["srcapy","crawl","read_dushu"])

方法二

import os

# 无法获取控制台输出的内容，只是简单的执行cmd指令，返回命令退出状态，其中结果为0表示执行成功
# retValue = os.system("ipconfig")
# print(retValue)

# 可以获取控制台输出的内容，返回的是一个file对象
# 'r' 消除转义符带来的影响,即'\'
# retValue = os.popen('ipconfig', 'r')
# res = retValue.read()
# for line in res.splitlines():
#     print(line)
# retValue.close()

# 执行scrapy 程序
retValue = os.popen('scrapy list', 'r')
res = retValue.read()
for line in res.splitlines():
    print(line)
retValue.close()

参考:

https://docs.scrapy.org/en/latest/topics/commands.html

https://docs.scrapy.org/en/latest/topics/architecture.html

【当你用心写完每一篇博客之后,你会发现它比你用代码实现功能更有成就感!】

标签：python,self,cmd,dushu,item,scrapy,div,response
From： https://blog.51cto.com/u_12826294/5782622