创建一个项目
scrapy startproject myfrist(project_name)
创建一个爬虫
scrapy genspider 爬虫名 爬虫地址
需要安装pillow
pip install pillow
报错:twisted.python.failure.Failure OpenSSL.SSL.Error
解决方案
pip uninstall cryptography
pip install cryptography==36.0.2
代码 文件名: 爬虫名.py
通过xpath确定图片的地址和名字,然后yield推送过去
import scrapy
class ZolSpider(scrapy.Spider):
name = "爬虫名"
allowed_domains = ["域名"]
start_urls = ["图片地址"]
def parse(self, response):
url = response.xpath('//img[@id="bigImg"]/@src').get()
name = response.xpath('string(//h3)').get()
yield {'image_url':url,'name':name}
文件名:pipeline.py
图片管道scrapy.pipelines.images import ImagesPipeline
get_media_requests(self, item, info)方法:
- 发送请求,下载图片
- 转发文件名
实现file_path(self, request, response=None, info=None, *, item=None)
- 修改文件名与保存路径
from scrapy.pipelines.images import ImagesPipeline
from scrapy.http.request import Request
import re
class MyImagePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# print(item.get('image_urls'))
return Request(item['image_url'])
def file_path(self, request, response=None, info=None, *, item=None):
# 处理文件名中的特殊字符
# name = item.get('name').strip().replace('\r\n\t\t','').replace('(','').replace(')','').replace('/','_')
name = re.sub('/','_',re.sub('[\s()]','',item.get('name')))
return f'{name}.jpg'
修改setting.py
保存路径
IMAGES_STORE = r"./imgs"
通道调用MyImagePipeline
ITEM_PIPELINES = {
"scrapy04.pipelines.MyImagePipeline": 300
}
建立一个请求头
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
标签:None,name,get,保存,item,Scrapy,ImagePipeline,import,scrapy From: https://www.cnblogs.com/jiangjiayun/p/17501506.html