一种通过编码的反爬虫机制

遇到一个反爬虫机制，该网页为gbk编码网页，但是请求参数中，部分请求使用gbk编码，部分请求使用utf8编码，还设置了一些不进行编码的安全字符，在爬取的过程中形成了阻碍。

提示：在认为参数设置正常，又无法正确爬取数据的情况下，通过response.requests.headers和esponse.requests.body与浏览器的请求数据的对比，会比较容易发现问题。

1、python的url编码函数

也有通用函数- decode（解码为unicode）,encode（编码）

import requests  
from urllib.parse import urlencode  # 对整个字典进行编码
from requests.utils import quote,unquote  # 对字符串进行编码和解码

# 对整个字典进行编码，对于需要进行局部编码不太方便。
Signature:
urlencode(
    query,
    doseq=False,
    safe='',  # 不进行编码的字符串
    encoding=None,
    errors=None,
    quote_via=<function quote_plus at 0x0000027AB54DD678>,
)
Docstring:
Encode a dict or sequence of two-element tuples into a URL query string.

If any values in the query arg are sequences and doseq is true, each
sequence element is converted to a separate parameter.

If the query arg is a sequence of two-element tuples, the order of the
parameters in the output will match the order of parameters in the
input.

The components of a query arg may each be either a string or a bytes type.

The safe, encoding, and errors parameters are passed down to the function
specified by quote_via (encoding and errors only if a component is a str).

# 对字符串进行编码
Signature: requests.utils.quote(string, safe='/', encoding=None, errors=None)
Docstring:
quote('abc def') -> 'abc%20def'

# 对字符串进行解码
Signature: requests.utils.unquote(string, encoding='utf-8', errors='replace')
Docstring:
Replace %xx escapes by their single-character equivalent. The optional
encoding and errors parameters specify how to decode percent-encoded
sequences into Unicode characters, as accepted by the bytes.decode()
method.
By default, percent-encoded sequences are decoded with UTF-8, and invalid
sequences are replaced by a placeholder character.

unquote('abc%20def') -> 'abc def'.

2、模拟过程 - 以错误编码模式

这里还设置了一个安全符号，如果设置。"+"会进行编码。

# 如果错误的采用了网页编码进行访问
# 1. 请求数据
req1 = '+导出+++'
print('1.原始请求数据:',req1)
# 2. 自行编码以后发送出去的数据
req2 = quote(req1,encoding='gbk',safe='+')
print('2.浏览器编码后发出的数据：',req2)
# 3. 服务区接受的数据为req2，服务器会编码以后提交给后台
req3 = unquote(req2,encoding='gbk')
print('3.经服务器编码以后，后台接收到的数据：',req3)
# 4. 后台逆编码获取浏览器发送的原始数据
req4 = quote(req3,encoding='gbk',safe='+')
print('4.后台逆编码，重新得到浏览器发送的原始数据：',req4)
# 5. 后台再进行正确编码获取正确数据
req5 = unquote(req4,encoding='utf8')
print('5.后台逆编码，重新得到浏览器发送的原始数据,最终获取数据为乱码：',req5)

最后后台获取的数据是乱码，可以进行校准，如果是非正常编码数据则不进行处理。

1.原始请求数据: +导出+++
2.浏览器编码后发出的数据： +%B5%BC%B3%F6+++
3.经服务器编码以后，后台接收到的数据： +导出+++
4.后台逆编码，重新得到浏览器发送的原始数据： +%B5%BC%B3%F6+++
5.后台逆编码，重新得到浏览器发送的原始数据： +����+++

3、模拟过程-正确例子

# 1. 请求数据
req1 = '+导出+++'
print('1.原始请求数据:',req1)
# 2. 自行编码以后发送出去的数据
req2 = quote(req1,encoding='utf8',safe='+')
print('2.浏览器编码后发出的数据：',req2)
# 3. 服务区接受的数据为req2，服务器会编码以后提交给后台
req3 = unquote(req2,encoding='gbk')
print('3.经服务器编码以后，后台接收到的数据：',req3)
# 4. 后台逆编码获取浏览器发送的原始数据
req4 = quote(req3,encoding='gbk',safe='+')
print('4.后台逆编码，重新得到浏览器发送的原始数据：',req4)
# 5. 后台再进行正确编码获取正确数据
req5 = unquote(req4,encoding='utf8')
print('5.后台逆编码，重新得到浏览器发送的原始数据：',req5)

1.原始请求数据: +导出+++
2.浏览器编码后发出的数据： +%E5%AF%BC%E5%87%BA+++
3.经服务器编码以后，后台接收到的数据： +瀵煎嚭+++
4.后台逆编码，重新得到浏览器发送的原始数据： +%E5%AF%BC%E5%87%BA+++
5.后台逆编码，重新得到浏览器发送的原始数据： +导出+++

标签：编码,浏览器,encoding,爬虫,print,+++,后台,机制
From： https://www.cnblogs.com/q-q56731526/p/17328901.html

一种通过编码的反爬虫机制

相关文章

赞助商

阅读排行