实验一
作业①
(1)实验1-1
from bs4 import BeautifulSoup
from urllib import request
import urllib
url = "http://www.shanghairanking.cn/rankings/bcur/2020 "
res = urllib.request.urlopen(url)
data = res.read().decode() #爬取网页数据
soup = BeautifulSoup(data,"lxml")
i = 1
for tr in soup.find('tbody').children: #读取前17个大学的信息
if(i>17):
break
else:
a = tr('a')
tds = tr('td')
print("排名:{:<5} 学校名称:{:<10} 省市:{:<10} 学校类型:{:<5} 总分:{:<8}".format(tds[0].text.strip(), a[0].string.strip(), tds[2].text.strip(),
tds[3].text.strip(), tds[4].text.strip()))
i = i + 1
(2)心得体会
通过BeautifulSoup和request能够对网站的数据进行提取,便于我们得到所需的数据,
作业②
(1)实验1-2利用BeautifulSoup对当当网的书包数据进行提取
from bs4 import BeautifulSoup
import urllib.parse
import urllib.request
url = 'http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input&'
n = int(input("请输入想要爬取的起始页面:"))
m = int(input("请输入想要爬取的最后页面:"))
#输入想要查询的页面
for i in range(n,m+1):
print("--------------正在爬取第{}页面的书包数据------------------".format(i))
page = {
"page_index": i
}
page = urllib.parse.urlencode(page) #实现翻页
url = url + page
res = urllib.request.urlopen(url)
data = res.read().decode('gb2312')
# print(data)
soup = BeautifulSoup(data, 'lxml')
# print(soup)
goods = soup.select('p[class = "name"] > a[title]') #寻找商品的名字
price = soup.select('span[class="price_n"]') #寻找商品的价格
for i in range(60):
print(goods[i].text + ":" + "\t" + price[i].text)
(2)实验1-2利用requrest和re对当当网的书包数据进行提取
import re
import urllib.parse
import urllib.request
url = 'http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input&'
start = int(input("请输入要爬取的初始页面:"))
end = int(input("请输入要爬取的最终页面:"))
cnt = 1 #用于记录输出得商品数
for i in range(start,end+1):
page = {
'page_index': i
}
page = urllib.parse.urlencode(page)
url = url + page #实现翻页技术
res = urllib.request.urlopen(url)
data = res.read().decode("gb2312")
#print(data)
price = re.findall('<span class="price_n">¥(.*?)</', data)#查询商品价格,输出类型为list
goods = re.findall('<a title=" (.*?)" href="//product.dangdang.com/', data) #查询商品名称,输出类型为list
#print(goods[0], '\n', goods[1])
for j in range(len(price)):
print(cnt,price[j],goods[2*j+1])
cnt += 1
(3)心得体会
在搜索的角度上BeautifulSoup需要寻找标签,而re库可以比较快的寻找到目标数据,二者各有优点。
作业③
(1)实验1-3
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import requests
url = "https://news.fzu.edu.cn/info/1011/31611.htm" #网站地址
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.36'
}
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
data = response.read().decode() #读取网站数据
#print(data)
soup = BeautifulSoup(data,'lxml')
#print(soup)
pic_jpg = soup.find_all('p',class_='vsbcontent_img') #寻找图片所在的位置
i = 0
pic_list =[]
for i in range(len(pic_jpg)): #将图片地址存到列表当中
if(i%2 != 0):
pic =pic_jpg[i]
img = pic.find('img')['src']
img = "https://news.fzu.edu.cn" + img
pic_list.append(img)
print(img)
for i in range(len(pic_list)):
picture = requests.get(pic_list[i])
with open(f"D:\python\exercise\数据采集\实验1-3\images{i + 1}.jpg", "wb") as f:
f.write(picture.content) #下载图片
print(f"第{i + 1}副图片下载完毕")
(2)心得体会
在下载图片的时候会比较慢,网页上的图片只有两张,但是搜寻出来的数据却存在两个None空值,只好通过if语句将其跳过
标签:url,request,urllib,采集,实验,print,import,数据,page From: https://www.cnblogs.com/chenhhhh/p/17721617.html