【Python】爬虫笔记-ConnectionResetError(10054)

标签：Python ConnectionResetError 10054 爬虫 537.36 sleep requests com

0x01

在对网站图片进行批量爬取的过程中遇到了一个典型问题：

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

最开始以为是网站的反爬措施，于是加上了 User-Agent 字段。

【参考：python requests 报错 Connection aborted ConnectionResetError RemoteDisconnected 解决方法_whatday的博客-CSDN博客】

import random
import requests

user_agent_list = [
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
    ]

headers = {}
headers['User-Agent'] = random.choice(user_agent_list)

同时在每次下载图片后加上睡眠：time.sleep(3)，但问题未解决，每爬几个链接就抛出异常。猜测是不是因为每次 response = requests.get(url, headers=headers) 后没有释放连接，又添加 response.close()，但不起作用。【注：这里的用法并不正确，python中requests.get后需要close吗？ - 知乎 (zhihu.com) 建议如果没有指定同一session的情况下，调用 requests.get() 不需要 close 释放资源】

0x02

尝试在代码中忽略掉异常：

try:
    r = requests.get(url, headers=headers)
    with open(myPath, 'wb') as f:
        f.write(r.content)
except requests.exceptions.RequestException as e:
    pass

【注：ConnectionError是RequestException的子类】程序没有按照预想地执行下去，反而会直接卡住。有文章指出是因为 requests 默认设置请求超时，但没有设置读取超时，导致程序不报错也不继续执行。设置 timeout 参数即可。【参考：python requests 超时。卡住没反应的处理方法。 - 简书 (jianshu.com)】解决读取超时问题后，发现后续地爬取的页面几乎全部失败，但浏览器直接访问图片链接却不受影响。

0x03

于是上 StackOverflow 上逛了一圈，发现有三种说法：

① 服务器-客户端超时分歧

This can be caused by the two sides of the connection disagreeing over whether the connection timed out or not during a keepalive. (Your code tries to reused the connection just as the server is closing it because it has been idle for too long.) You should basically just retry the operation over a new connection. (I'm surprised your library doesn't do this automatically.)

【参考：twitter - python: [Errno 10054] An existing connection was forcibly closed by the remote host - Stack Overflow】

② 请求频率问题，服务器识别恶意访问

The web server actively rejected your connection. That's usually because it is congested, has rate limiting or thinks that you are launching a denial of service attack. If you get this from a server, you should sleep a bit before trying again. In fact, if you don't sleep before retry, you are a denial of service attack. The polite thing to do is implement a progressive sleep of, say, (1,2,4,8,16,32) seconds.

【参考：python - How to solve the 10054 error - Stack Overflow】

③ 标头问题

除了设置 User-Agent 外，可能还需要设置 Accept, Accept-Encoding, Content-Type, Connection 等属性。

0x04

由于已经尝试设置过 sleep 和 User-Agent，且在抛出异常时不影响浏览器访问链接，故判断是服务器-客户端超时分歧。重复执行请求有多种方案：

① 函数回调

try:
    r = requests.get(url, headers=headers, timeout=10)
    with open(mypath, 'wb') as f:
        f.write(r.content)
except requests.exceptions.RequestException as e:
    download_image(i, url)

② while循环

用while包裹住try-except即可

③ 设置max-retries（session自带）

④ retry模块

Python爬虫实例（三）：错误重试，超时处理的解决方法 - 知乎 (zhihu.com)

0x05

time.sleep 进阶用法：

Python 爬虫经常需要睡眠防止被封IP time sleep_DN_XIAOXIAO的博客-CSDN博客_爬虫time. sleep的优点

关于keep-alive：

python-requests 必需如下使用才能保持keep-alive_黄传通的博客-CSDN博客

反爬&反反爬：

关于异常：

标签：Python,ConnectionResetError,10054,爬虫,537.36,sleep,requests,com
From： https://www.cnblogs.com/victorique-de-blois/p/16988559.html