1、安装scrapy
win7下conda安装scrapy
conda search scrapy
conda install scray=2.8.0
将C:\Program Files\Anaconda3\envs\my_env3.8\Scripts加入环境变量
这样cmd中就可以使用scrapy命令
cmd需要重启。
性能相关及Scrapy笔记-博客园 武沛齐
2、Scrapy项目配置
- 要爬取网站使用的可信任证书(默认支持):
setting.py中加入
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
-
运行:
出现2023-12-12 20:40:05 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://space.bilibili.com/3493119047764705/video>
爬虫出现Forbidden by robots.txt
在setting改变ROBOTSTXT_OBEY为False,让scrapy不要遵守robot协议,之后就能正常爬取了。
ROBOTSTXT_OBEY = False
-
scrapy抓取豆瓣报出403错误?
在setting.py中设置USER_AGENT 伪装成浏览器即可
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
3、第一个scrapy应用
douban.py
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response):
sel = Selector(response)
list_items = sel.css('#content > div > div.article > ol > li')
for item in list_items:
movie_item = MovieItem()
movie_item['title'] = item.css('span.title::text').extract_first()
movie_item['rating_num'] = item.css('span.rating_num::text').extract_first()
movie_item['subject'] = item.css('span.inq::text').extract_first()
yield movie_item
items.py
import scrapy
class MovieItem(scrapy.Item):
title = scrapy.Field()
rating_num = scrapy.Field()
subject = scrapy.Field()
运行 scrapy crawl douban -o douban.csv
将数据输出到douban.csv文件
4、爬取多页
import scrapy
from scrapy import Selector, Request
from scrapy.http import HtmlResponse
from Test1.items import MovieItem
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response: HtmlResponse):
sel = Selector(response)
list_items = sel.css('#content > div > div.article > ol > li')
for item in list_items:
movie_item = MovieItem()
movie_item['title'] = item.css('span.title::text').extract_first()
movie_item['rating_num'] = item.css('span.rating_num::text').extract_first()
movie_item['subject'] = item.css('span.inq::text').extract_first()
yield movie_item
# print(response.text)
hrefs_list = sel.css('div.paginator > a::attr(href)')
for href in hrefs_list:
url_parms = href.extract()
url = response.urljoin(url_parms)
yield Request(url=url)
这里有个bug:
爬第2页时,还会取到第一页的url:“https://movie.douban.com/top250?start=0&filter=” 再爬一次第一页,由于第一次爬第一页是https://movie.douban.com, 所以scrapy无法去重。
解决办法1: start_urls = ["https://movie.douban.com/top250?start=0&filter="]
解决办法2(推荐):不解析页码url,直接构造start_requests
...
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
# start_urls = ["https://movie.douban.com/top250?start=0&filter="]
def start_requests(self):
for page in range(10):
yield Request(url=f'https://movie.douban.com/top250?start={ page * 25 }&filter=')
def parse(self, response: HtmlResponse):
...
6、将数据写入Excel:
安装openpyxl:pip install openpyxl
pipelines.py:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import openpyxl
class Test1Pipeline:
def __init__(self):
self.wb = openpyxl.Workbook() #创建一个工作簿
#wb.create_sheet() # 创建一张新的工作表
self.ws = self.wb.active # 默认的工作表
self.ws.title = 'Top250'
self.ws.append(('标题', '评分', '主题')) # 第一行
def close_spider(self, spider):
self.wb.save('电影数据.xlsx')
def process_item(self, item, spider):
title = item.get('title', '')
ratting = item.get('rating_num') or ''
sb = item.get('subject', '')
self.ws.append((title, ratting, sb))
return item # return了,终端就会打印
setting.py中,将管道配置取消注释。300是优先级,数字越小,越先执行。
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"Test1.pipelines.Test1Pipeline": 300,
}
7、将数据写入MySQL
创建数据库:CREATE DATABASE Test1Spider
使用数据库:use test1spider
删除表:DROP TABLE 语句用于删除数据库中的现有表:
-- 删除表,如果存在的话
DROP TABLE IF EXISTS mytable;
-- 直接删除表,不检查是否存在
DROP TABLE mytable;
创建数据表:(其中需要注意:每字段前不是tab,而是两个空格)
CREATE TABLE IF NOT EXISTS `tb_top_movie`(
`mov_id` INT UNSIGNED AUTO_INCREMENT comment '编号',
`title` varchar(50) NOT NULL comment '标题',
`rating` decimal(3,1) NOT NULL comment '评分',
`subject` varchar(200) default '' comment '主题',
PRIMARY KEY (`mov_id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8 comment='Top电影表';
完美解决 ERROR 1064 (42000): You have an error in your SQL syntax
pipelines.py
import pymysql
class DbPipeline:
def __init__(self):
self.conn = pymysql.connect(host='localhost',
port=3306,
user='root',
password='123456',
database='test1spider',
charset='utf8mb4')
self.cursor = self.conn.cursor()
def close_spider(self, spider):
self.conn.commit()
self.conn.close()
def process_item(self, item, spider):
title = item.get('title', '')
rating = item.get('rating_num') or 0
sb = item.get('subject', '')
self.cursor.execute(
'insert into tb_top_movie (title, rating, subject) values (%s, %s, %s)',
(title, rating, sb)
)
return item
批处理:
import pymysql
class DbPipeline:
def __init__(self):
self.conn = pymysql.connect(host='localhost',
port=3306,
user='root',
password='123456',
database='test1spider',
charset='utf8mb4')
self.cursor = self.conn.cursor()
self.data = []
def close_spider(self, spider):
if len(self.data) > 0:
self._write_to_db()
self.conn.close()
def process_item(self, item, spider):
title = item.get('title', '')
rating = item.get('rating_num') or 0
sb = item.get('subject', '')
self.data.append((title, rating, sb))
if len(self.data) == 100:
self._write_to_db()
return item # return了,下一个pipeline才能拿到item继续处理
def _write_to_db(self):
self.cursor.executemany(
'insert into tb_top_movie (title, rating, subject) values (%s, %s, %s)',
self.data
)
self.conn.commit()
self.data.clear()
8、其它
class TestSpider(scrapy.Spider):
...
def parse(self, response: HtmlResponse): # 这时,pycharm会有警告:由于这个是重写父类方法,但它没有与父类方法保持一致
...
解决:
class TestSpider(scrapy.Spider):
...
def parse(self, response: HtmlResponse, **kwargs): # 父类方法有**kwargs参数
...
标签:douban,title,movie,self,scrapy,item,Scrapy
From: https://www.cnblogs.com/zhlforhe/p/18014682