首页 > 编程语言 >python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!

python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!

时间:2022-11-16 10:32:44浏览次数:45  
标签:HTTP cn python t0 爬取 amazon ue requests true


通过修改HTTP报文头部,来成功获取网页内容!

 

 

python
import requests
r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
r.status_code
r.encoding

 

 

>>> import requests
>>> r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>> r.status_code
503
>>> r.encoding
'ISO-8859-1'
>>>
>>> r.encoding = r.apparent_encoding
>>> r.text
'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\n<!--[if IE 7]>    <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\n<!--[if IE 8]>    <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title dir="ltr">Amazon CAPTCHA</title>\n<meta name="viewport" content="width=device-width">\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\n<script>\n\nif (true === true) {\n    var ue_t0 = (+ new Date()),\n        ue_csm = window,\n        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },\n        ue_furl = "fls-cn.amazon.cn",\n        ue_mid = "AAHKV2X7AFYLW",\n        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\n        ue_sn = "opfcaptcha.amazon.cn",\n        ue_id = \'T33DQKKSXC8PZYZ4E5NQ\';\n}\n</script>\n</head>\n<body>\n\n<!--\n        To discuss automated access to Amazon data please contact [email protected].\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\n-->\n\n<!--\nCorreios.DoNotSend\n-->\n\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\n\n    <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\n\n        <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\n\n        <div class="a-box a-alert a-alert-info a-spacing-base">\n            <div class="a-box-inner">\n                <i class="a-icon a-icon-alert"></i>\n                <h4>请输入您在下方看到的字符</h4>\n                <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>\n                </div>\n            </div>\n\n            <div class="a-section">\n\n                <div class="a-box a-color-offset-background">\n                    <div class="a-box-inner a-padding-extra-large">\n\n                        <form method="get" action="/errors/validateCaptcha" name="">\n                            <input type=hidden name="amzn" value="FPQ3hpXcLlqNYaQR2w10gA==" /><input type=hidden name="amzn-r" value="&#047;gp&#047;product&#047;B01M8L5Z3Y" />\n                            <div class="a-row a-spacing-large">\n                                <div class="a-box">\n                                    <div class="a-box-inner">\n                                        <h4>请输入您在这个图片中看到的字符:</h4>\n                                        <div class="a-row a-text-center">\n                                            <img src="https://images-na.ssl-images-amazon.com/captcha/cucusdhr/Captcha_bxyxvfyusz.jpg">\n                                        </div>\n                                        <div class="a-row a-spacing-base">\n                                            <div class="a-row">\n                                                <div class="a-column a-span6">\n                                                    <label for="captchacharacters">输入字符</label>\n                                                </div>\n                                                <div class="a-column a-span6 a-span-last a-text-right">\n                                                    <a οnclick="window.location.reload()">换一张图</a>\n                                                </div>\n                                            </div>\n                                            <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\n                                        </div>\n                                    </div>\n                                </div>\n                            </div>\n\n                            <div class="a-section a-spacing-extra-large">\n\n                                <div class="a-row">\n                                    <span class="a-button a-button-primary a-span12">\n                                        <span class="a-button-inner">\n                                            <button type="submit" class="a-button-text">继续购物</button>\n                                        </span>\n                                    </span>\n                                </div>\n\n                            </div>\n                        </form>\n\n                    </div>\n                </div>\n\n            </div>\n\n        </div>\n\n        <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\n\n        <div class="a-text-center a-spacing-small a-size-mini">\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使用条件</a>\n            <span class="a-letter-space"></span>\n            <span class="a-letter-space"></span>\n            <span class="a-letter-space"></span>\n            <span class="a-letter-space"></span>\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">隐私声明</a>\n        </div>\n\n        <div class="a-text-center a-size-mini a-color-secondary">\n          &copy; 1996-2015, Amazon.com, Inc. or its affiliates\n          <script>\n           if (true === true) {\n             document.write(\'<img src="https://fls-cn.amaz\'+\'on.cn/\'+\'1/oc-csi/1/OP/requestId=T33DQKKSXC8PZYZ4E5NQ&js=1" />\');\n           };\n          </script>\n          <noscript>\n            <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=T33DQKKSXC8PZYZ4E5NQ&js=0" />\n          </noscript>\n        </div>\n    </div>\n    <script>\n    if (true === true) {\n        var elem = document.createElement("script");\n        elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";\n        document.getElementsByTagName(\'head\')[0].appendChild(elem);\n    }\n    </script>\n</body></html>\n'
>>>

 

 

#说明不是网络错误,但是不可以访问!对网络爬虫的限制。一个是http的头,另外一个就是协议!
查看一下请求的头部!
>>> r.request.headers
{'User-Agent': 'python-requests/2.20.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>>#User-Agent': 'python-requests/2.20.1   诚实的告知对方我是“爬虫!”

更改头部信息!
>>> k = {'user-agent':'Mozilla/5.0'}

#构造键字对!  用来修改头部!  Mozilla/5.0是浏览器的头!就是伪装成一个浏览器去进行访问!
然后!
>>> url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
>>> r = requests.get(url,headers = k)
>>>
>>> r = requests.get(url,headers = k)
>>> r.request.headers
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>>#已经变成: Mozilla/5.0

>>> r.status_code
200    #成功!
>>>
>>> r.status_code
200
>>> r.text[:10000]
'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\n<!--[if IE 7]>    <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\n<!--[if IE 8]>    <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title dir="ltr">Amazon CAPTCHA</title>\n<meta name="viewport" content="width=device-width">\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\n<script>\n\nif (true === true) {\n    var ue_t0 = (+ new Date()),\n        ue_csm = window,\n        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },\n        ue_furl = "fls-cn.amazon.cn",\n        ue_mid = "AAHKV2X7AFYLW",\n        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\n        ue_sn = "opfcaptcha.amazon.cn",\n        ue_id = \'Q7EMQ0MRT0KYC536Q05R\';\n}\n</script>\n</head>\n<body>\n\n<!--\n        To discuss automated access to Amazon data please contact [email protected].\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\n-->\n\n<!--\nCorreios.DoNotSend\n-->\n\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\n\n    <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\n\n        <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\n\n        <div class="a-box a-alert a-alert-info a-spacing-base">\n            <div class="a-box-inner">\n                <i class="a-icon a-icon-alert"></i>\n                <h4>请è¾\x93å\x85¥æ\x82¨å\x9c¨ä¸\x8bæ\x96¹ç\x9c\x8bå\x88°ç\x9a\x84å\xad\x97符</h4>\n                <p class="a-last">æ\x8a±æ\xad\x89ï¼\x8cæ\x88\x91们å\x8fªæ\x98¯æ\x83³ç¡®è®¤ä¸\x80ä¸\x8bå½\x93å\x89\x8d访é\x97®è\x80\x85并é\x9d\x9eè\x87ªå\x8a¨ç¨\x8båº\x8fã\x80\x82为äº\x86è¾¾å\x88°æ\x9c\x80ä½³æ\x95\x88æ\x9e\x9cï¼\x8c请确ä¿\x9dæ\x82¨æµ\x8fè§\x88å\x99¨ä¸\x8aç\x9a\x84 Cookie å·²å\x90¯ç\x94¨ã\x80\x82</p>\n                </div>\n            </div>\n\n            <div class="a-section">\n\n                <div class="a-box a-color-offset-background">\n                    <div class="a-box-inner a-padding-extra-large">\n\n                        <form method="get" action="/errors/validateCaptcha" name="">\n                            <input type=hidden name="amzn" value="byevvYbW69v2h8EgJ6MuPw==" /><input type=hidden name="amzn-r" value="&#047;gp&#047;product&#047;B01M8L5Z3Y" />\n                            <div class="a-row a-spacing-large">\n                                <div class="a-box">\n                                    <div class="a-box-inner">\n                                        <h4>请è¾\x93å\x85¥æ\x82¨å\x9c¨è¿\x99个å\x9b¾ç\x89\x87ä¸\xadç\x9c\x8bå\x88°ç\x9a\x84å\xad\x97符ï¼\x9a</h4>\n                                        <div class="a-row a-text-center">\n                                            <img src="https://images-na.ssl-images-amazon.com/captcha/qamfifum/Captcha_biitgjptru.jpg">\n                                        </div>\n                                        <div class="a-row a-spacing-base">\n                                            <div class="a-row">\n                                                <div class="a-column a-span6">\n                                                    <label for="captchacharacters">è¾\x93å\x85¥å\xad\x97符</label>\n                                                </div>\n                                                <div class="a-column a-span6 a-span-last a-text-right">\n                                                    <a οnclick="window.location.reload()">æ\x8d¢ä¸\x80å¼\xa0å\x9b¾</a>\n                                                </div>\n                                            </div>\n                                            <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\n                                        </div>\n                                    </div>\n                                </div>\n                            </div>\n\n                            <div class="a-section a-spacing-extra-large">\n\n                                <div class="a-row">\n                                    <span class="a-button a-button-primary a-span12">\n                                        <span class="a-button-inner">\n                                            <button type="submit" class="a-button-text">继ç»\xadè´\xadç\x89©</button>\n                                        </span>\n                                    </span>\n                                </div>\n\n                            </div>\n                        </form>\n\n                    </div>\n                </div>\n\n            </div>\n\n        </div>\n\n        <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\n\n        <div class="a-text-center a-spacing-small a-size-mini">\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使ç\x94¨æ\x9d¡ä»¶</a>\n            <span class="a-letter-space"></span>\n            <span class="a-letter-space"></span>\n            <span class="a-letter-space"></span>\n            <span class="a-letter-space"></span>\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">é\x9a\x90ç §\x81声æ\x98\x8e</a>\n        </div>\n\n        <div class="a-text-center a-size-mini a-color-secondary">\n          &copy; 1996-2015, Amazon.com, Inc. or its affiliates\n          <script>\n           if (true === true) {\n             document.write(\'<img src="https://fls-cn.amaz\'+\'on.cn/\'+\'1/oc-csi/1/OP/requestId=Q7EMQ0MRT0KYC536Q05R&js=1" />\');\n           };\n          </script>\n          <noscript>\n            <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=Q7EMQ0MRT0KYC536Q05R&js=0" />\n          </noscript>\n        </div>\n    </div>\n    <script>\n    if (true === true) {\n        var elem = document.createElement("script");\n        elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";\n        document.getElementsByTagName(\'head\')[0].appendChild(elem);\n    }\n    </script>\n</body></html>\n'
>>>
#已经是正常的文本啦!

 

 

就是通过修改HTTP报文头部,来成功获取网页内容!

标签:HTTP,cn,python,t0,爬取,amazon,ue,requests,true
From: https://blog.51cto.com/dream666uping/5855072

相关文章

  • BIG T 下学期选修_python作业
    一共提交了10次作业3.1:importrandomb_Set=[]c_Set=[]i=1000while(i):b_Set.append(random.randint(0,10000))i=i-1i=1000while(i):c_Set.append(random.ran......
  • hc - python
    Grafanaalerts健康检查我们可以在Grafana的panel中设置alert当报警触发,我们有另外的程序会捕捉到它,并通过创建jira工单的方式,通报给相应的Team去处理为了能成功......
  • python课本学习-第五章
    一、列表的概念1、列表的创建列表是由一组任意类型的值组合而成的序列,组成列表的值称为元素,每个元素之间用逗号隔开。列表中的元素是可变的#列表类似于c++中的数组,数......
  • Python - request 报错:raise RemoteDisconnected("Remote end closed connection with
    2022-11-1521:46:20,261INFO[get_data.py(get_product_mode:46)]-当前page======>:255Traceback(mostrecentcalllast):File"D:\jlc_auto_test\fa_search_tes......
  • Python locust工具使用详解
    今年负责部门的人员培养工作,最近在部门内部分享和讲解了locust这个工具,今天再博客园记录下培训细节。相信你看完博客,一定可以上手locust这个性能测试框架了。一、简介1......
  • python神经网络编程
    计算机系统:输入->(计算)->输出建立模型可以模拟事情的运作神经网络的基本思想:持续细化误差值。大的误差需要大的修正值,小的误差需要小的修正值。尝试得到一个答案,并多......
  • Python 文本文件拖上转自适应图片 - 学习笔记(2022.11.16)
    Python文本文件拖上转自适应图片功能:1、支持拖拽执行2、文本文件转为自适应尺寸图片1importre2importos3importsys4importtime5fromPI......
  • 如何正确遵守 Python 代码规范
    前言无规矩不成方圆,代码亦是如此,本篇文章将会介绍一些自己做项目时遵守的较为常用的Python代码规范。命名大小写模块名写法:module_name包名写法:package_nam......
  • python 中 循环结构同时传入多个参数
     001、>>>list1=[("aa",100,400),("bb","kk","yy"),(33,400,500)]>>>fori,j,kinlist1:##利用列表、元组同时传入多个参数...print(i,......
  • python 中统计每一个字符串中每一个字符出现的次数
     001、直接使用字典进行统计>>>str1="aaaabbcdddefff"##测试字符串>>>dict1=dict()>>>foriinstr1:##利用条件分支进行判断.........