在 Scrapy 文档中,有以下示例说明如何在 Scrapy 中使用带有登录数据的经过身份验证的会话:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
但是有没有办法使用证书登录进行此身份验证?
很遗憾,Scrapy 本身并不直接支持使用客户端证书进行身份验证。Scrapy 主要依赖于 requests 库进行网络请求,而 requests 库默认情况下不发送客户端证书。
但是,可以通过以下几种方法来实现使用证书登录:
1. 使用自定义 Downloader Middleware:
- 创建一个自定义的 Downloader Middleware,并在其中处理证书加载和请求修改。
-
在 Middleware 的
process_request
方法中,使用requests.Session
对象加载的证书,并将 Session 对象附加到请求的meta
属性中。 -
修改
LoginSpider
的parse
方法,从请求的meta
属性中获取 Session 对象,并使用该 Session 对象发送表单数据。
示例代码:
```python import scrapy from scrapy.http import FormRequest from scrapy.downloadermiddlewares.redirect import RedirectMiddleware import requests from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager from requests.packages.urllib3.util import ssl_
class SSLCertAdapter(HTTPAdapter): """ 自定义的 Adapter,用于加载证书 """ def init (self, cert_file, key_file, args, kwargs): self.cert_file = cert_file self.key_file = key_file super(SSLCertAdapter, self). init ( args, **kwargs)
def init_poolmanager(self, connections, maxsize, block=False, **pool_kwargs):
context = ssl_.create_default_context(ssl_.PROTOCOL_TLS)
context.load_cert_chain(certfile=self.cert_file, keyfile=self.key_file)
return PoolManager(
num_pools=connections,
maxsize=maxsize,
block=block,
ssl_context=context,
**pool_kwargs
)
class CertMiddleware(RedirectMiddleware): """ 自定义的 Downloader Middleware,用于加载证书并修改请求 """ def init (self, settings): super(CertMiddleware, self). init (settings) self.cert_file = settings.get('CERT_FILE') self.key_file = settings.get('KEY_FILE')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_request(self, request, spider):
if 'use_cert' in request.meta and request.meta['use_cert']:
session = requests.Session()
session.mount('https://', SSLCertAdapter(self.cert_file, self.key_file))
request.meta['cert_session'] = session
return None
class LoginSpider(scrapy.Spider): name = 'example.com' start_urls = ['https://www.example.com/users/login.php']
def parse(self, response):
return FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login,
meta={'use_cert': True}
)
def after_login(self, response):
session = response.meta['cert_session']
# 使用 session 发送后续请求
```
settings.py:
```python DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.CertMiddleware': 543, }
CERT_FILE = 'path/to/your/cert.pem' KEY_FILE = 'path/to/your/key.pem' ```
-
使用其他支持客户端证书的 Python 库:
-
可以使用其他支持客户端证书的 Python 库,例如
urllib3
或http.client
,并手动构建和发送请求。
请注意,以上代码示例仅供参考,需要根据的实际情况进行修改。
标签:python,web-scraping,scrapy,certificate From: 19338894