1. 自定义实现文件存储:
(1). spidertest/pipelines.py:
import codecs
import json
class JsonPipeline(object):
# 自定义json文件的导出
def __init__(self):
# 打开json文件
self.file = codecs.open('test.json', 'w', encoding="utf-8")
def process_item(self, item, spider):
# ensure_ascii不设置false,写入中文或其它编码会出错
lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(lines)
return item
# 当srapy关闭的时候,关闭文件
def spider_closed(self, spider):
self.file.close()
注:
①. codecs比open简化了编码的细节.
(2). spidertest/settings.py:
ITEM_PIPELINES = {
'spidertest.pipelines.JsonPipeline': 2,
}
2. 使用scrapy自带的:
spidertest/pipelines.py:
from scrapy.exporters import JsonItemExporter
class JsonExporterPipleline(object):
#调用scrapy提供的json export导出json文件
def __init__(self):
self.file = open('test.json', 'wb') # wb是二进制的方式
self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting() # 停止导出
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
注:
①. exporters还提供了csv、xml写入.
标签:pipelines,self,spider,item,json,file,def
From: https://blog.51cto.com/u_16251183/7556518