标签：入门 get res 爬虫 data print requests com

爬虫入门

爬虫介绍

# 爬虫：spider，网络蜘蛛

# 本质原理：
	-现在所有的软件原理：大部分都是基于http请求发送和获取数据的
    	-pc端的网页
        -移动端app
    -模拟发送http请求，从别人的服务端获取数据
    -绕过反扒：不同程序反扒措施不一样，比较复杂
    
    
# 爬虫原理
	-发送http请求【requests，selenium】----》第三方服务端----》服务端响应的数据解析出想要的数据【selenium,bs4】---》入库(文件，excel，mysql,redis,mongodb。。)
    -scrapy:专业的爬虫框架
    
    
# 爬虫是否合法
	-爬虫协议：每个网站根路径下都有robots.txt，这个文件规定了，该网站，哪些可以爬取，哪些不能爬
    
    
    
# 百度：大爬虫
	-百度搜索框中输入搜索内容，回车，返回的数据，是百度数据库中的数据
    -百度一刻不停的在互联网中爬取各个页面，链接地址--》爬完存到自己的数据库
    -当你点击，跳转到真正的地址上去了
    -核心：搜索，海量数据中搜索出想要的数据
    -seo：免费的搜索，排名靠前
    -sem：花钱买关键字

requests模块发送get请求

模拟发送http请求的模块，requests不仅仅做爬虫用它，我们调用第三方接口也需要使用request模块

下载requests模块
	pip install requests
	他本质就是封装了我们内置模块urlib3文件

import requests
res = requests.get('http://139.196.6.104:8080/api/v1/home/banner/')
print(res.text)  # http响应体的文本内容

get请求携带参数简介

1.发送get请求携带参数
	？后面拼接我们需要搜索的内容
    res = requests.get('https://image.baidu.com/search/index?ct=201326592&tn=baiduimage&word=%E7%BE%8E%E5%A5%B3%E5%9B%BE%E7%89%87')
    print(res.text)

2.使用params参数携带，我们也可以通过这个访问浏览器
res = requests.get('https://image.baidu.com/s',params={
    'wd':'黑丝',
    'name':'liuruiqiaiheisi',
})
print(res.text)
	https://www.baidu.com/s?wd=黑丝$name='liuruiqiaiheisi'

3.url的编码与解码
    https://www.baidu.com/s?wd=%E9%BB%91%E4%B8%9D$name=%27liuruiqiaiheisi%27
    from urllib import parse

    res = parse.quote('飞浆')
    print(res)
    res = parse.unquote('E9%BB%91%E4%B8%9D')
    print(res)

携带请求头

http 请求，有请求头，有的网站，通过某些请求头来做反扒


1. 请求头中带数据---->爬取某个网站，不能正常返回，模拟的不像
网站做反扒，没有携带请求头中的客户端类型
User-Agent：客户端类型：有浏览器，手机端浏览器，爬虫类型，程序，scrapy。。一般伪造成浏览器
referer：上次访问的地址：Referer: https://www.lagou.com/gongsi/
    如果要登录，模拟向登录接口发请求，正常操作必须在登录页面上才能干这事，如果没有携带referer，它就认为你是恶意的，拒绝调
    图片防盗链
cookie： 认证后的cookie，就相当于登录了
header={
    客户端类型
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
res=requests.get('https://dig.chouti.com/',headers=header)
print(res.text)

携带cookie

1.请求中携带cookie
方式一:直接带在请求头中
模拟点赞
data = {
    'linkId': '37000580'
}
header = {
    # 有些需要携带客户端类型
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
    # 携带cookie
    'Cookie': 'deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI3NWI4ZjRmZS01M2YzLTRmNTItYWZmYy0xMzJmNTI0NDc4ZGQiLCJleHBpcmUiOiIxNjY4NTE2Mjc0Nzk1In0.qVqo-yds7ztYO6GZMPj-9FhXs8mxpMoulet8pkBBS44; __snaker__id=qbS7wNid03A6wZYz; _9755xjdesxxd_=32; YD00000980905869%3AWM_TID=srMkZxB0puhEVVQQURPFTvwQsk4KdXrD; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1669172761; YD00000980905869%3AWM_NI=Wr8grMbEIKMPaXeGz06juobxTgQcKdwyT5l6%2Fha83tA4ZYLtKm27VQBFSqxOVbXYvm6jI5toMnKobvQEwnb72R9kJulPdZWOPRLfZx5sRpMribfOPvzQybrBCZLgcnyiRm8%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eed9c173b8b28d83ca6da6eb8ba7d45e978e8f83c5478bb9adb0b38091b2ad84f62af0fea7c3b92a95aef888f147a29b87acd05ca388aba8f93cf6b6bb90e17498b99f84ca70b5a79cd6ef3eedb8afd9b28083ec96bacf8089b3fed8d55b829a9d92fb4596be98baf567f6b687d6f7658fb28590e17e939bb98fe13df28ef78cc97fb090acbbd645948e8487d974f7ad9e91d07aabbeaab0d37eba8fb783f14a81868488e266bbb79e8ebb37e2a3; gdxidpyhxdE=8SjOSxwPqpl1iX5v22%5CDkciK8k%2B%2FHcINc%2Fmmqp%2Ft6s0uEGllDXiEpolQZAlpusA9faYGgZzkYqVKYrI%5C5L3lERe12QKBpkA0u7Q7v9%5CA%2B4houpHgn1NZRLZ%5C2r95sox8vvclyeM486hxGWAWcpqSokP9KMvEJrLzq8%5CJ%5C7CoNNMaQTBz%3A1669202978861; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjZHVfNTMyMDcwNzg0NjAiLCJleHBpcmUiOiIxNjcxNzk0MTI2ODI5In0.BPQ8kqfWZuIs1_cdwnLxbc_8LBsBOxMqwKqqH12-jTo; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1669203149',
}
# res = requests.get('https://dig.chouti.com/link/vote',data=data,headers=header)
print(res.text)


方法二:
data = {
    'linkId': '37000580'
}
header = {
    # 有些需要携带客户端类型
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
# 将cookie单独携带在外部
res = requests.get('https://dig.chouti.com/link/vote', data=data, headers=header, cookies={'key': 'value'})
print(res.text)

发送post请求

1.发送post请求
session = requests.session()
data = {
    'username': '2385086332.com',
    'password': 'jsoeph520',
    'captcha': '4561',
    'remember': '1',
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = requests.post('http://www.aa7a.cn/user.php', data=data)
print(res.text)
print(res.cookies)  # 响应头中得cookie，如果正常登录，这个cookie 就是登录后的cookie  RequestsCookieJar：当成字典

访问首页，携带cookie
res2 = requests.get('http://www.aa7a.cn/', cookies=res.cookies)
res2 = requests.get('http://www.aa7a.cn/')
print('[email protected]' in res2.text)


2. post请求携带数据 data={} ,json={}   drf后端，打印 request.data
data=字典是使用默认编码格式：urlencoded
json=字典是使用json 编码格式
res = requests.post('http://www.aa7a.cn/user.php', json={})


3. request.session的使用：当request使用，但是它能自动维护cookie
session=requests.session()
data = {
    'username': '2385086332.com',
    'password': 'jsoeph520',
    'captcha': '4561',
    'remember': '1',
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = session.post('http://www.aa7a.cn/user.php', data=data)
res2 = session.get('http://www.aa7a.cn/')
print('[email protected]' in res2.text)

响应Response

import requests

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}
respone = requests.get('https://www.jianshu.com', params={'name': 'lqz', 'age': 19},headers=header)
# respone属性
print(respone.text)  # 响应体的文本内容
print(respone.content)  # 响应体的二进制内容
print(respone.status_code)  # 响应状态码
print(respone.headers)  # 响应头
print(respone.cookies)  # 响应cookie
print(respone.cookies.get_dict())  # cookieJar对象，获得到真正的字段
print(respone.cookies.items())  # 获得cookie的所有key和value值
print(respone.url)  # 请求地址
print(respone.history)  # 访问这个地址，可能会重定向，放了它冲定向的地址
print(respone.encoding)  # 页面编码

获取二进制数据

1.获取二进制数据:图片
    res = requests.get('https://lmg.jj20.com/up/allimg/1111/06301Q05335/1P630105335-6-1200.jpg')
    with open('美女.jpg','wb') as f:
        f.write(res.content)

    
2.获取二进制视频，视频文件可能过大，所以需要一行行写入文件
    res = requests.get('https://vd3.bdstatic.com/mda-mjsefrtuhsny5p00/sc/cae_h264/1635330522837563915/mda-mjsefrtuhsny5p00.mp4')
    with open('美女.mp4','wb') as f:
        for line in res.iter_content():
            f.write(line)

解析json格式

1.前后分离后，后端给的数据，都是json格式，
	解析json格式
    res = requests.get(
        'https://api.map.baidu.com/place/v2/search?ak=6E823f587c95f0148c19993539b99295&region=%E4%B8%8A%E6%B5%B7&query=%E8%82%AF%E5%BE%B7%E5%9F%BA&output=json')
    print(res.text)
    print(type(res.text))
    print(res.json()['results'][0]['name'])
    print(type(res.json()))

标签：入门,get,res,爬虫,data,print,requests,com
From： https://www.cnblogs.com/joseph-bright/p/16919822.html

爬虫入门

爬虫入门

爬虫介绍

requests模块发送get请求

get请求携带参数简介

携带请求头

携带cookie

发送post请求

响应Response

获取二进制数据

解析json格式

相关文章

赞助商

阅读排行