NBA 球员数据采集实战
为了完成本关任务,你需要掌握:
数据获取简介
数据获取简介
url 地址: https://www.basketball-reference.com/players/a/
请求头:
进入网站后,等待网页加载完毕,点击 F12 或右击选择检查,搜索找到页面 a/,设置请求头信息。
请求头图
网页主界面如下图所示:
网页主界面
我们需要获取所有 A-Z 姓球员的基本数据和一些详细数据。
其中 A-Z 姓中球员主界面的链接规律如下:
# A 页面
https://www.basketball-reference.com/players/a/
# B 页面
https://www.basketball-reference.com/players/b/
# C 页面
https://www.basketball-reference.com/players/c/
...
我们可以发现,其中产生变化就只有后缀 a/b/c/...了,掌握到这一规律即可。
基本数据就是从网页主界面中获取的数据,如下图所示:
基本数据图
我们需要在网页主界面中获取球员的姓名、位置、身高。由于体重列表栏中部分球员存在空值,我们使用 Xpath 解析后会导致排列顺序混乱,所以体重数据我们在球员详情页中获取。
我们在网页主界面中点击 F12 或右击选择检查,查看球员详情页链接:
球员详情页链接获取图
可以直接获取到球员详情页链接的后缀。
在球员详情页中,我们需要获取球员的详细数据:
详细数据图
编程要求
根据数据采集简介中的内容与要求,打开右侧代码文件窗口,在多处 Begin 至 End 区域填充代码(具体填充区域:全局变量处、入口函数处、主页面解析 parse 函数处),完善程序。
根据给定的 url 地址,爬取所有 A-Z 姓球员的相关数据,结合获取的字段信息,完成任务。最终将爬取的数据保存为 csv 文件,存储到路径:./nba_data.csv,编码格式为:utf-8-sig。
基础 url 地址: https://www.basketball-reference.com/players/a/
所有需要获取的字段信息如下:
字段名 解释 获取信息说明
id 球员 id 解析球员详情页链接获取,如:https://www.basketball-reference.com/players/a/abdelal01.html,则 id 为:abdelal01
info_url 球员详情信息网址 网站首页表格中 Player 列
player_name 球员姓名 网站首页表格中 Player 列
player_pos 战术位置 网站首页表格中 Pos 列
player_ht 身高(英尺) 网站首页表格中 Ht 列
player_wt 体重(磅) 球员详情页(网站首页的体重列中存在空值)
player_age 球员年龄 球员详情页,动态加载数据,需要手动计算。(格式与网址保持一致)
country 国籍 球员详情页(大写)
college 就读大学 球员详情页
high_school 就读高中 球员详情页
rank_year 同届排名 球员详情页
draft 选秀信息 球员详情页
draft_date 选秀日期 球员详情页
work_year 经验 球员详情页
team_count 效力球队数量 球员详情页
last_team_name 最后效力球队 球员详情页
season 赛季 球员详情页
games_count 场次 球员详情页
PTS 场均得分 球员详情页
TRB 场均篮板 球员详情页
AST 场均助攻 球员详情页
FG 投篮命中率 球员详情页
FG3 三分球命中率 球员详情页
FT 罚球命中率 球员详情页
EFG 有效命中率 球员详情页
PER 效率值 球员详情页
WS 胜率 球员详情页
firstTime 首秀时间 球员详情页中表格内的season列,其第一个赛季链接中第一比赛上场时间。
lastTime 退役时间 球员详情页中表格内的season列,其最后一个赛季链接中最后一场比赛上场时间。
采集后部分数据显示如下:
avdijde01,https://www.basketball-reference.com/players/a/avdijde01.html,Deni Avdija,G-F,6-9,210lb,21-284d,IL,,,,"Washington Wizards, 1st round (9th pick, 9th overall), 2020 NBA Draft","December 23, 2020",2 years,1,Washington Wizards,Career,136,7.6,5.1,1.7,42.6,31.6,72.9,50.2,10.0,3.6,2020-12-23,2022-04-10
averibi01,https://www.basketball-reference.com/players/a/averibi01.html,Bird Averitt,G,6-1,170lb,68-144d,US,Pepperdine,Hopkinsville,,"Portland Trail Blazers, 4th round (3rd pick, 55th overall), 1973 NBA Draft","October 21, 1976",5 years,5,Buffalo Braves,Career,366,12.1,1.9,2.9,40.5,24.9,74.3,41.2,12.5,3.9,1973-10-10,1978-04-09
averywi01,https://www.basketball-reference.com/players/a/averywi01.html,William Avery,G,6-2,197lb,43-67d,US,Duke,,,"Minnesota Timberwolves, 1st round (14th pick, 14th overall), 1999 NBA Draft","November 13, 1999",3 years,1,Minnesota Timberwolves,Career,142,2.7,0.7,1.4,33.0,25.5,71.4,37.8,7.3,-0.9,1999-11-05,2002-03-17
awtrede01,https://www.basketball-reference.com/players/a/awtrede01.html,Dennis Awtrey,C,6-10,235lb,74-234d,US,Santa Clara,San Jose,,"Philadelphia 76ers, 3rd round (12th pick, 46th overall), 1970 NBA Draft","October 14, 1970",12 years,6,Portland Trail Blazers,Career,733,4.8,4.6,2.0,45.9,-,65.2,45.9,10.5,22.4,1970-10-14,1981-12-19
ayayijo01,https://www.basketball-reference.com/players/a/ayayijo01.html,Joel Ayayi,G,6-5,180lb,22-223d,FR,Gonzaga,Paris,,,"October 25, 2021",1 year,1,Washington Wizards,Career,7,0.3,0.4,0.6,16.7,0.0,-,16.7,3.4,0.0,2021-10-20,2022-03-06
ayongu01,https://www.basketball-reference.com/players/a/ayongu01.html,Gustavo Ayón,C,6-10,250lb,37-196d,MX,,,,,"January 1, 2012",3 years,4,Atlanta Hawks,Career,135,4.7,4.4,1.3,53.6,0.0,50.4,53.6,14.7,5.1,2011-12-26,2014-04-16
pendeje02,https://www.basketball-reference.com/players/p/pendeje02.html,Jeff Ayres,F,6-9,240lb,35-168d,US,Arizona State,Rancho Cucamonga,,"Sacramento Kings, 2nd round (1st pick, 31st overall), 2009 NBA Draft","December 22, 2009",6 years,4,Los Angeles Clippers,Career,237,2.9,2.7,0.4,55.3,40.0,77.6,55.5,12.4,6.3,2009-10-27,2016-04-13
aytonde01,https://www.basketball-reference.com/players/a/aytonde01.html,Deandre Ayton,C,6-11,250lb,24-83d,BS,Arizona,Phoenix,2017 (3),"Phoenix Suns, 1st round (1st pick, 1st overall), 2018 NBA Draft","October 17, 2018",4 years,1,Phoenix Suns,Career,236,16.3,10.5,1.6,59.9,25.0,75.4,60.2,20.7,24.6,2018-10-17,2022-04-10
azubuke01,https://www.basketball-reference.com/players/a/azubuke01.html,Kelenna Azubuike,G,6-5,220lb,38-302d,GB,Kentucky,Tulsa,2002 (31),,"January 2, 2007",5 years,2,Dallas Mavericks,Career,208,10.5,4.0,1.1,45.9,40.9,77.0,51.9,14.3,9.3,2007-01-02,2012-04-26
azubuud01,https://www.basketball-reference.com/players/a/azubuud01.html,Udoka Azubuike,C,6-10,280lb,23-27d,NG,Kansas,Jacksonville,2016 (33),"Utah Jazz, 1st round (27th pick, 27th overall), 2020 NBA Draft","December 23, 2020",2 years,1,Utah Jazz,Career,32,3.0,2.6,0.0,70.7,-,66.7,70.7,15.9,0.8,2020-12-23,2022-04-10
如果运行后抛出以下异常信息:
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.basketball-reference.com', port=443):
Max retries exceeded with url: /players/a/ (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f43a73029e8>,
'Connection to www.basketball-reference.com timed out. (connect timeout=30)'))
解决方法:
在入口函数的请求中添加 headers 参数来设置请求头。请求头内容请打开网站进入调试模式后查询页面 a/ 中的 Requests Headers 获取。
代码如下:
import sys
import csv
import datetime
import re
import string
import time
from lxml import etree
import requests
# TODO 全局变量
##################### Begin #####################
save_fp = open("./nba_data.csv", "w", encoding="utf-8-sig", newline="") # 创建存储文件对象
##################### End #####################
csv_writer = csv.writer(save_fp) # 创建 csv 对象
start_time = int(time.time()) # 记录开始时间
main_response = None # 记录主页面内容
main_count = 0 # 记录主页面当前循环次数
total_count = 0 # 记录当前页面总循环次数
# 解析页面
# 获取球员详情页链接、名称、
标签:www,basketball,reference,采集,详情页,球员,NBA,com From: https://blog.csdn.net/weixin_51439828/article/details/142653757