Scrapy 模块
目录1 Scrapy 简介
Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。其具有以下功能:
- 支持全栈数据爬取操作
- 支持XPath
- 异步的数据下载
- 支持高性能持久化存储
- 分布式
官网:Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
1.1 安装
# Twisted是用Python实现的基于事件驱动的网络引擎框架,Scrapy 基于 Twisted
pip install twisted
# 安装scrapy
pip install scrapy
1.2 Scrapy 全局命令
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
1.3 Scrapy项目命令
Usage:
scrapy <command> [options] [args]
Available commands:
check Check spider contracts
crawl Run a spider
edit Edit spider
list List available spiders
parse Parse URL (using its spider) and print the results
2 Scrapy 操作
2.1 创建项目操作
# 创建项目文件
scrapy startproject <scrapyPJname>
# 创建爬虫文件
cd <scrapyPJname>
scrapy genspider <spiderName> www.xxx.com
# 执行
scrapy crawl spiderName
2.2 配置项目文件
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 ' \
'Safari/537.36 '
## 不遵从robots协议
ROBOTSTXT_OBEY = False
## Log
LOG_LEVEL = 'ERROR'
LOG_FILE = 'log.txt'
# 300表示的优先级,越小优先级越高
ITEM_PIPELINES = {
'scrapyPJ01.pipelines.Scrapypj01Pipeline': 300,
}
1.4 数据解析
extract():列表是有多个列表元素
extract_first():列表元素只有单个
1.5 持久化存储
基于终端指令:
- 只可以将parse方法的返回值存储到磁盘文件中
- scrapy crawl first -o file.csv
基于管道:pipelines.py
- 编码流程:
- 1.数据解析
- 2.在item的类中定义相关的属性
- 3.将解析的数据存储封装到item类型的对象中.item['p']
- 4.将item对象提交给管道
- 5.在管道类中的process_item方法负责接收item对象,然后对item进行任意形式的持久化存储
- 6.在配置文件中开启管道
- 细节补充:
- 管道文件中的一个管道类表示将数据存储到某一种形式的平台中。
- 如果管道文件中定义了多个管道类,爬虫类提交的item会给到优先级最高的管道类。
- process_item方法的实现中的return item的操作表示将item传递给下一个即将被执行的管道类
3 实例
3.1 基于终端命令持久化存储
ctspider.py
import scrapy
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/']
def parse(self, response):
data_list = []
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
# 注意:xpath返回的列表中的列表元素是Selector对象,我们要解析获取的字符串的数据是存储在该对象中
# 必须经过一个extract()的操作才可以将改对象中存储的字符串的数据获取
# title = div.xpath('./div/div/div[1]/a/text()') # [<Selector xpath='./div/div/div[1]/a/text()' data='泽连斯基何以当选《时代》2022年度人物?'>]
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
# xpath返回的列表中的列表元素有多个(Selector对象),使用extract()取出
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract() # ['知世']
content = div.xpath('./div[1]/div/div[1]/div[3]/text()').extract_first() # 美国《时代》杂志将乌克兰总统泽连斯基及“乌克兰精神”评为2022年度风云人...
# 返回列表数据
data = {
'title':title,
'author':author,
'content':content
}
data_list.append(data)
return data_list
scrapy crawl ctspider -o ctresult.csv
3.2 引入item
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# Field是一个万能的数据类型
title = scrapy.Field()
author = scrapy.Field()
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/']
# 终端命令持久化存储
def parse(self, response):
ctresponse = response.xpath('')
title = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div/div[1]/a/text()').extract_first()
author = ctresponse.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div/div[1]/div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
return ctitem
scrapy crawl ctspider -o ctspider.csv
3.3 基于管道的持久化存储:pipelines.py
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pipelines.py
: 专门用作于持久化存储
# Pipleline
class Scrapypj01Pipeline(object):
fp = None
def open_spider(self, spider):
print('爬虫开始')
self.fp = open('./ctresult.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
title = item['title']
author = item['author']
data = '{0},{1}\n'.format(title, author)
self.fp.write(data)
print(data, '写入成功')
return item
def close_spider(self, spider):
print('爬虫结束')
self.fp.close()
settings.py
ITEM_PIPELINES = {
'scrapypj01.pipelines.Scrapypj01Pipeline': 300,
}
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/']
def parse(self, response):
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
# 将item对象提交给管道
yield ctitem
3.4 基于 Mysql 的持久化存储
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pipelines.py
: 专门用作于持久化存储
# Mysql
import pymysql
# 专门用作于持久化存储
class Scrapypj01Pipeline(object):
fp = None
def open_spider(self, spider):
print('爬虫开始')
self.fp = open('./ctresult.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
title = item['title']
author = item['author']
data = '{0},{1}\n'.format(title, author)
self.fp.write(data)
print(data, '写入成功')
return item
def close_spider(self, spider):
print('爬虫结束')
self.fp.close()
class MysqlPipeline(object):
conn = None
cursor = None
def open_spider(self, spider):
print('爬虫开始')
self.conn = pymysql.connect(host='10.1.1.8', port=3306, user='root', password='Admin@123', db='spiderdb')
def process_item(self, item, spider):
sql = 'insert into ctinfo values(%s,%s)'
data = (item['title'], item['author'])
self.cursor = self.conn.cursor()
try:
self.cursor.execute(sql, data)
self.conn.commit()
except Exception as error:
print(error)
self.conn.rollback()
return item
def close_spider(self, spider):
print('爬虫结束')
self.cursor.close()
self.conn.close()
settings.py
ITEM_PIPELINES = {
'scrapypj01.pipelines.MysqlPipeline': 301,
}
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/']
# 终端命令持久化存储
def parse(self, response):
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
# 将item对象提交给管道
yield ctitem
3.5 基于 Redis 的持久化存储
items.py
import scrapy
class Scrapypj01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pipelines.py
: 专门用作于持久化存储
# Redis
from redis import Redis
class RedisPipeline(object):
conn = None
def open_spider(self, spider):
print('爬虫开始')
self.conn = Redis(host='10.1.1.8', port=6379, password='Admin@123')
def process_item(self, item, spider):
self.conn.lpush('ctlist', item)
return item
settings.py
ITEM_PIPELINES = {
'scrapypj01.pipelines.RedisPipeline': 302,
}
ctspider.py
import scrapy
import scrapypj01.items as items
class CtspiderSpider(scrapy.Spider):
name = 'ctspider'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/']
# 终端命令持久化存储
def parse(self, response):
div_list = response.xpath('/html/body/main/div/div/div[1]/div/div[2]/div[1]/div')
for div in div_list:
title = div.xpath('./div/div/div[1]/a/text()').extract_first()
author = div.xpath('./div[1]/div/div[2]/div[1]/span[1]/span/text()')[0].extract()
# 实例化item类型的对象
ctitem = items.Scrapypj01Item()
ctitem['title'] = title
ctitem['author'] = author
# 将item对象提交给管道
yield ctitem
标签:title,self,author,scrapy,item,Scrapy,模块,div
From: https://www.cnblogs.com/f-carey/p/16971288.html