练习爬虫的网站
爬虫-基础1
爬虫的目标很简单,就是拿到想要的数据。这里有一个网站,里面有一些数字。把这些数字的总和,输入到答案框里面,即可通过本关。
题目其实是蛮简单的,就是抓取一个网页中的数据,然后求和。
代码如下:
# #把这些数字的总和,输入到答案框里面,即可通过本关。
import requests
from lxml import etree
import re
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
}
# 定义一个变量等下用来存放计算结果
result = 0
# 创建session保持回话连接
s = requests.Session()
url = 'http://www.glidedsky.com/login'
# 发送get请求获取登陆页面信息
res = s.get('http://www.glidedsky.com/login', headers=header).text
# 使用re解析登陆页面,得到token值
token = re.findall('<input type="hidden" name="_token" value="(.*?)">', res)[0]
# 将token和账号密码封装成字典
data = {
'_token': token,
"email": "1xxxx3.com",
"password": "xxxx",
}
# 发送post请求,携带data模拟登陆,并获取页面源代码
req = s.post(url=url, data=data).text
# 用xpath解析得到题目所在连接
html = etree.HTML(req)
text_url = html.xpath('//td[@class="col-8"]/a/@href')[0]
# 再次发送get请求获取题目所在页面信息
req_url = s.get(url=text_url, headers=header).text
# 再次解析获得题目详情页面连接,并拼接成完整链接
math_url = 'http://www.glidedsky.com' + re.findall('<a href="(.*?)" target="_blank"', req_url)[0]
# 再次发送get请求获取题目详细信息
text = s.get(url=math_url, headers=header).text
# 解析页面获取数字
tree = etree.HTML(text)
#定位到具体div。text():获取节点中的所有文本
for text_data in tree.xpath('//div[@class="row"]/div/text()'):
# 去掉字符串左右两侧空格
num = int(text_data.strip())
result += num
print(result)
结果:
251077
爬虫二
爬虫往往不能在一个页面里面获取全部想要的数据,需要访问大量的网页才能够完成任务。
这里有一个网站,还是求所有数字的和,只是这次分了1000页。
# 多个界面把这些数字的总和,输入到答案框里面,即可通过本关。
#http://www.glidedsky.com/level/web/crawler-basic-2?page=1
import requests
import re
#整理每个页面的url
URL = []
for i in range(1,1001):
URL.append(f'http://www.glidedsky.com/level/web/crawler-basic-2?page={i}')
#请求头和cookie
head = {"user-Agent":"Mozilla/5.0 (Windows NT 10.0; Win32; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4240.183 Safari/537.36",
'cookie':'glidedsky_session=eyJpdiI6ImxaOVlSOHdKSko1SExCV1dzVjVubkE9PSIsInZhbHVlIjoiNlA0VDRFSU5rMHdWQk1DOUk3TzE0SkR4eGxMT3JYR1dPOFliV3V5M1d0UXVsS1huZklIU0UxSHVRdCtaTHMzSyIsIm1hYyI6IjllNzdjNzJhNWI4YTFmNzdjMmE1YmM3ZDk1ZWFhYjRjMjY2NDM0OTEyN2Q3OTI3ODAwYzEyNTBhNjNjODY0OGMifQ%3D%3D; expires=Wed, 24-Feb-2021 09:36:53 GMT; Max-Age=7200; path=/; httponly'
}
#获取数据
num = []
n =0
for url in URL:
html = requests.get(url,headers=head).text
n += 1
print(n)
num.append(re.findall('<div class="col-md-1">[\s]*(.*)[\s]*</div>', html))
count = 0
for i in num:
for num in i:
count += int(num)
print(count)
结果:
3320304
爬虫-字体反爬-1
# 多个界面把这些数字的总和,输入到答案框里面,即可通过本关。
#http://www.glidedsky.com/level/web/crawler-font-puzzle-1
import requests
import re
import base64
from fontTools.ttLib import TTFont
#整理每个页面的url
URL = []
for i in range(1,1001):
URL.append(f'http://www.glidedsky.com/level/web/crawler-font-puzzle-1?page={i}')
# 'http://www.glidedsky.com/level/web/crawler-font-puzzle-1?page=3'
jg = [] #存放每个页面的数值
sss = 0
for url in URL:
sss += 1
print(sss)
#请求头和cookie
head = {"user-Agent":"Mozilla/5.0 (Windows NT 10.0; Win32; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4240.183 Safari/537.36",
'cookie':'glidedsky_session=eyJpdiI6ImxaOVlSOHdKSko1SExCV1dzVjVubkE9PSIsInZhbHVlIjoiNlA0VDRFSU5rMHdWQk1DOUk3TzE0SkR4eGxMT3JYR1dPOFliV3V5M1d0UXVsS1huZklIU0UxSHVRdCtaTHMzSyIsIm1hYyI6IjllNzdjNzJhNWI4YTFmNzdjMmE1YmM3ZDk1ZWFhYjRjMjY2NDM0OTEyN2Q3OTI3ODAwYzEyNTBhNjNjODY0OGMifQ%3D%3D; expires=Wed, 24-Feb-2021 09:36:53 GMT; Max-Age=7200; path=/; httponly'
}
#爬取的页面信息
html = requests.get(url,headers=head).text
base = base64.b64decode(re.findall('base64,(.*)\) format',html)[0])
#base编写写入文件
with open('1.ttf','wb') as f:
f.seek(0)
f.truncate()
f.write(base)
#对应关系表
font = TTFont('1.ttf')
a = font.getGlyphOrder() #['.notdef', 'eight', 'four', 'zero', 'five', 'one', 'six', 'seven', 'nine', 'two', 'three']
a.pop(0)
dic = {'zero':0,'one':1,'two':2,'three':3,'four':4,'five':5,'six':6,'seven':7,'eight':8,'nine':9}
# print(a)
d = {dic[q]:a.index(q) for q in a} #{7: 0, 6: 1, 0: 2, 9: 3, 5: 4, 8: 5, 4: 6, 3: 7, 1: 8, 2: 9}
#重组规则
bh = re.findall('<div class="col-md-1">[\s]*(.*)[\s]*</div>', html) #['426', '438', '016', '575', '513', '112', '479', '433', '112', '189', '063', '151']
for i in bh:
a = ''
for ll in i:
a += str(d[int(ll)])
jg.append(int(a))
count = 0
for num in jg:
count += int(num)
print(count)
标签:www,re,网站,练习,爬虫,url,glidedsky,import,com
From: https://www.cnblogs.com/megshuai/p/18518306