首页 > 编程语言 >【python爬虫课程设计】2022-23赛季欧洲冠军联赛——绘制球员数据柱状图和词云

【python爬虫课程设计】2022-23赛季欧洲冠军联赛——绘制球员数据柱状图和词云

时间:2022-12-22 17:45:51浏览次数:63  
标签:search name true 课程设计 冠军联赛 柱状图 each data columns

一、选题的背景

1.背景:2022-23赛季欧洲冠军联赛(2022–23 UEFA Champions League)由欧洲足球联合会主办的第68届欧洲足球俱乐部的顶级赛事,也是以欧洲冠军联赛名义下的第31届赛事。本届决赛将于2023年6月10日在土耳其伊斯坦布尔的阿塔图克奥林匹克体育场举行,这座球场原先被选为举行2021年欧洲冠军联赛决赛,但因新冠病毒于土耳其大流行而作出调整。2022-23赛季欧洲冠军联赛冠军可自动获得2023–24年欧洲冠军联赛小组赛资格,同时亦可得到参与2023年欧洲超级杯的资格,与2022–23年欧足联欧洲联赛冠军球队角逐锦标。

2.目的:最近因为欧洲疫情的原因,好多足球比赛都不得不停止,相信很多球迷现在在家中都没办法看比赛了。而同样作为球迷的我突发奇想,如果用数据分析的角度去看欧洲的世界级球员,是否每个都名副其实呢?

二、主题式网络爬虫设计方案

 1.主题式网络爬虫名称

  【python爬虫课程设计】2022-23赛季欧洲冠军联赛——绘制球员数据柱状图和词云

2.主题式网络爬虫爬取的内容与数据特征分析

爬取相关球员数据,包含位置,年龄,出场时间,进球,射门数等。并通过数据可视化表现出来。

 

3.主题式网络爬虫设计方案概述(包括实现思路与技术难点)

        实现思路:  1. 数据采集。具体来源于

                                    http://www.tzuqiu.cc/matchPlayerStatistics/querysStat.json?

               2. 进行数据的清洗,对需要的数据进行定位和提取,并进行存取。

        3. 传入数据,绘制词云和进一步数据可视化。

   技术难点:1.节点的寻找。

          2.数据可视化的灵活运用

、主题页面的结构特征分析

    1.寻找所需的数据。登录网站,右键网络源代码。

 

 

 

2.数据清洗

   对所需的库进行导入:

import os

import time

 

import requests

from lxml import etree

import json

import openpyxl

import re

from matplotlib import pyplot as plt

from wordcloud import WordCloud

from pyecharts import options as opts

from pyecharts.charts import Bar

from pyecharts.commons.utils import JsCode

from pyecharts.globals import ThemeType

import pandas as np

import seaborn as sns

爬虫主要代码

# # 访问被拦截  添加请求头

headers = {

    'Accept': '*/*',

    'Accept-Encoding': 'gzip, deflate, br',

    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',

    'Connection': 'keep-alive',

    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW 64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 QIHU 360SE'

}

 

url = "http://www.tzuqiu.cc/matchPlayerStatistics/querysStat.json?columns[0][data]=id&columns[0][name]=&columns[0][" \

      "searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][" \

      "regex]=false&columns[1][data]=playerFormat&columns[1][name]=&columns[1][searchable]=true&columns[1][" \

      "orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][" \

      "data]=appsFormat&columns[2][name]=psp.appsCP&columns[2][searchable]=true&columns[2][orderable]=true&columns[" \

      "2][search][value]=&columns[2][search][regex]=false&columns[3][data]=minsFormat&columns[3][" \

      "name]=psp.minsCP&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][" \

      "search][regex]=false&columns[4][data]=goalsFormat&columns[4][name]=psp.goalsCP&columns[4][" \

      "searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][" \

      "regex]=false&columns[5][data]=assistsFormat&columns[5][name]=psp.assistsCP&columns[5][" \

      "searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][" \

      "regex]=false&columns[6][data]=cardsFormat&columns[6][name]=psp.cardsCP&columns[6][searchable]=true&columns[6][" \

      "orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][" \

      "data]=passSuccFormat&columns[7][name]=psp.passSuccCP&columns[7][searchable]=true&columns[7][" \

      "orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][" \

      "data]=bigChanceCreatedFormat&columns[8][name]=psp.bigChanceCreatedCP&columns[8][searchable]=true&columns[8][" \

      "orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][" \

      "data]=aerialWonFormat&columns[9][name]=psp.aerialWonCP&columns[9][searchable]=true&columns[9][" \

      "orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][" \

      "data]=mansFormat&columns[10][name]=psp.mansCP&columns[10][searchable]=true&columns[10][" \

      "orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][" \

      "data]=rateFormat&columns[11][name]=psp.rateCP&columns[11][searchable]=true&columns[11][" \

      "orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&start={{start}}&length={{" \

      "length}}&search[value]=&search[regex]=false&extra_param[competitionId]={{competitionId}}&extra_param[" \

      "orderCdnReq]=true&_=1670340557541"

 

html_url = "http://www.tzuqiu.cc"

response = requests.get(html_url, headers=headers)

# 生成HTML对象

html = etree.HTML(response.text)

wb = openpyxl.Workbook()

ul = html.xpath('//*[@id="competition-league-list"]/ul')[0]

ws = wb.active

ws.title = "球员数据表"

ws.append(

    ['球员', '所属队伍', '位置', '年龄', '出场时间(分钟)', '进球',

     '射门数',

     '场均射正', '场均过人', '场均被侵犯', '场均越位', '场均被抢断', '场均失误', '助攻', '红牌', '黄牌',

     '场均拦截',

     '场均造越位', '场均犯规', '致命失误', '场均传球', '场均关键传球', '场均传中', '场均长传',

     '场均直塞', '传球成功率(%)', '创造机会', '评分'])

#  --------------------------------获取数据------------------------------------------

 

for li in ul.getchildren():

    a = li.getchildren()[0]

    href = a.get("href")  # 获得比赛对应链接

    competition_name = a.xpath("string(.)")  # 获得比赛文本

    competition_id = re.findall("#*competitions/(.*?)/show.do", href)[0]  # 获得比赛对应ID

 

    # 只获取前5轮

    for stage in range(5):

 

        local_url = url.replace("{{start}}", "0").replace("{{length}}", "1").replace("{{competitionId}}",

                                                                                     str(competition_id))  # 拼接接口链接

        local_url = local_url + "&extra_param[season]=" + '22/23' + "&extra_param[stageName]=" + str(stage + 1)

        res = requests.get(local_url, headers=headers)

        data = json.loads(res.text)

 

        records_total = data['recordsTotal']  # 获得总数据数

        # 拼接接口链接

        local_url = url.replace("{{start}}", "0").replace("{{length}}", str(records_total)).replace(

            "{{competitionId}}",

            str(competition_id))

        local_url = local_url + "&extra_param[season]=" + '22/23' + "&extra_param[stageName]=" + str(stage + 1)

        # 再次请求获取全部数据

        res = requests.get(local_url, headers=headers)

        data = json.loads(res.text)

        if len(data['data']) == 0:

            continue

        data = data['data']

 

        for each in data:

            # 筛选掉无进球和助攻记录的球员

            # if each['goals'] == 0 and each['assists'] == 0:

            #     continue

 

            if 'playerMainPosition' not in each:

                each['playerMainPosition'] = '无'

            if 'age' not in each:

                each['age'] = '无'

 

            temp_list = [each['playerName'], each['teamName'], each['playerMainPosition'],

                         each['age'], each['mins'], each['goals'], each['shots'],

                         each['fouled'], each['dribbles'], each['fouled'], each['offsides'], each['disp'],

                         each['unsTouches'], each['assists'], each['redCards'], each['yelCards'],

                         each['interceptions'], each['offsideWon'], each['fouled'],

                         each['errorsSum'], each['passes'], each['keyPasses'],

                         each['crosses'],

                         each['longBall'], each['thBall'], each['passSucc'],

                         each['bigChanceCreated'], each['rate']]

 

            ws.append(temp_list)

        time.sleep(1)  # 防止服务器封禁IP 延迟请求

 

wb.save('22-23赛季世界顶级联赛球员数据表.xlsx')

运行后:

 

 

 

打开文件: 22-23赛季世界顶级联赛球员数据表

 

 

 

绘制wordcloud

wb = openpyxl.load_workbook('22-23赛季世界顶级联赛球员数据表.xlsx')

sheet_names = wb.sheetnames

 

 

def generate_image(frequencies, name):

    wordcloud = WordCloud(font_path="C:/Windows/Fonts/msyh.ttc",

                          background_color="white",

                          width=1920, height=1080)

    # 根据数据生成词云

    wordcloud.generate_from_frequencies(frequencies)

    # 保存词云

 

    wordcloud.to_file('%s.png' % name)

goal = {}

ws = wb.active

for row in ws.values:

    if row[0] == '球员':

        pass

    else:

        goal[row[0]] = float(row[5])

generate_image(goal, "球员进球数词云")

 

生成png图片

 

 

 

球员进球词云

 

 

 

3. 数据可视化 

wb = openpyxl.load_workbook('22-23赛季世界顶级联赛球员数据表.xlsx')

 

df = np.read_excel('22-23赛季世界顶级联赛球员数据表.xlsx',"球员数据表")

goals = df['进球'].values.tolist()

players = df["球员"].values.tolist()

# 清除掉进球数为0的项

count = len(goals)

remove_count = 0

i = 0

while count - remove_count > i:

    if goals[i] == 0:

        goals.pop(i)

        players.pop(i)

        remove_count += 1

        i -= 1

    i += 1

c = (

    Bar(init_opts=opts.InitOpts(width="1600px",

                                height="720px", theme=ThemeType.LIGHT))

    .add_xaxis(players)

    .add_yaxis("进球数", goals)

    # 生成文件

    .set_global_opts(

        title_opts={"text": "进球数"}

    )

    .render("球员进球数.html")

)

 

 

 

  1. 绘制进球与射门数对比
  2.  

     

 

 

 

场均传球数与传球成功率的回归

 

 

 

四、附完整程序源代码

import os

import time

 

import requests

from lxml import etree

import json

import openpyxl

import re

from matplotlib import pyplot as plt

from wordcloud import WordCloud

from pyecharts import options as opts

from pyecharts.charts import Bar

from pyecharts.commons.utils import JsCode

from pyecharts.globals import ThemeType

import pandas as np

import seaborn as sns

 

 

# # 访问被拦截  添加请求头

headers = {

    'Accept': '*/*',

    'Accept-Encoding': 'gzip, deflate, br',

    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',

    'Connection': 'keep-alive',

    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW 64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 QIHU 360SE'

}

 

url = "http://www.tzuqiu.cc/matchPlayerStatistics/querysStat.json?columns[0][data]=id&columns[0][name]=&columns[0][" \

      "searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][" \

      "regex]=false&columns[1][data]=playerFormat&columns[1][name]=&columns[1][searchable]=true&columns[1][" \

      "orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][" \

      "data]=appsFormat&columns[2][name]=psp.appsCP&columns[2][searchable]=true&columns[2][orderable]=true&columns[" \

      "2][search][value]=&columns[2][search][regex]=false&columns[3][data]=minsFormat&columns[3][" \

      "name]=psp.minsCP&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][" \

      "search][regex]=false&columns[4][data]=goalsFormat&columns[4][name]=psp.goalsCP&columns[4][" \

      "searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][" \

      "regex]=false&columns[5][data]=assistsFormat&columns[5][name]=psp.assistsCP&columns[5][" \

      "searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][" \

      "regex]=false&columns[6][data]=cardsFormat&columns[6][name]=psp.cardsCP&columns[6][searchable]=true&columns[6][" \

      "orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][" \

      "data]=passSuccFormat&columns[7][name]=psp.passSuccCP&columns[7][searchable]=true&columns[7][" \

      "orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][" \

      "data]=bigChanceCreatedFormat&columns[8][name]=psp.bigChanceCreatedCP&columns[8][searchable]=true&columns[8][" \

      "orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][" \

      "data]=aerialWonFormat&columns[9][name]=psp.aerialWonCP&columns[9][searchable]=true&columns[9][" \

      "orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][" \

      "data]=mansFormat&columns[10][name]=psp.mansCP&columns[10][searchable]=true&columns[10][" \

      "orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][" \

      "data]=rateFormat&columns[11][name]=psp.rateCP&columns[11][searchable]=true&columns[11][" \

      "orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&start={{start}}&length={{" \

      "length}}&search[value]=&search[regex]=false&extra_param[competitionId]={{competitionId}}&extra_param[" \

      "orderCdnReq]=true&_=1670340557541"

 

html_url = "http://www.tzuqiu.cc"

response = requests.get(html_url, headers=headers)

# 生成HTML对象

html = etree.HTML(response.text)

wb = openpyxl.Workbook()

ul = html.xpath('//*[@id="competition-league-list"]/ul')[0]

ws = wb.active

ws.title = "球员数据表"

ws.append(

    ['球员', '所属队伍', '位置', '年龄', '出场时间(分钟)', '进球',

     '射门数',

     '场均射正', '场均过人', '场均被侵犯', '场均越位', '场均被抢断', '场均失误', '助攻', '红牌', '黄牌',

     '场均拦截',

     '场均造越位', '场均犯规', '致命失误', '场均传球', '场均关键传球', '场均传中', '场均长传',

     '场均直塞', '传球成功率(%)', '创造机会', '评分'])

#  --------------------------------获取数据------------------------------------------

 

for li in ul.getchildren():

    a = li.getchildren()[0]

    href = a.get("href")  # 获得比赛对应链接

    competition_name = a.xpath("string(.)")  # 获得比赛文本

    competition_id = re.findall("#*competitions/(.*?)/show.do", href)[0]  # 获得比赛对应ID

 

    # 只获取前5轮

    for stage in range(5):

 

        local_url = url.replace("{{start}}", "0").replace("{{length}}", "1").replace("{{competitionId}}",

                                                                                     str(competition_id))  # 拼接接口链接

        local_url = local_url + "&extra_param[season]=" + '22/23' + "&extra_param[stageName]=" + str(stage + 1)

        res = requests.get(local_url, headers=headers)

        data = json.loads(res.text)

 

        records_total = data['recordsTotal']  # 获得总数据数

        # 拼接接口链接

        local_url = url.replace("{{start}}", "0").replace("{{length}}", str(records_total)).replace(

            "{{competitionId}}",

            str(competition_id))

        local_url = local_url + "&extra_param[season]=" + '22/23' + "&extra_param[stageName]=" + str(stage + 1)

        # 再次请求获取全部数据

        res = requests.get(local_url, headers=headers)

        data = json.loads(res.text)

        if len(data['data']) == 0:

            continue

        data = data['data']

 

        for each in data:

            # 筛选掉无进球和助攻记录的球员

            # if each['goals'] == 0 and each['assists'] == 0:

            #     continue

 

            if 'playerMainPosition' not in each:

                each['playerMainPosition'] = '无'

            if 'age' not in each:

                each['age'] = '无'

 

            temp_list = [each['playerName'], each['teamName'], each['playerMainPosition'],

                         each['age'], each['mins'], each['goals'], each['shots'],

                         each['fouled'], each['dribbles'], each['fouled'], each['offsides'], each['disp'],

                         each['unsTouches'], each['assists'], each['redCards'], each['yelCards'],

                         each['interceptions'], each['offsideWon'], each['fouled'],

                         each['errorsSum'], each['passes'], each['keyPasses'],

                         each['crosses'],

                         each['longBall'], each['thBall'], each['passSucc'],

                         each['bigChanceCreated'], each['rate']]

 

            ws.append(temp_list)

        time.sleep(1)  # 防止服务器封禁IP 延迟请求

 

wb.save('22-23赛季世界顶级联赛球员数据表.xlsx')

 

#  --------------------------------生成词云------------------------------------------

 

wb = openpyxl.load_workbook('22-23赛季世界顶级联赛球员数据表.xlsx')

sheet_names = wb.sheetnames

 

 

def generate_image(frequencies, name):

    wordcloud = WordCloud(font_path="C:/Windows/Fonts/msyh.ttc",

                          background_color="white",

                          width=1920, height=1080)

    # 根据数据生成词云

    wordcloud.generate_from_frequencies(frequencies)

    # 保存词云

 

    wordcloud.to_file('%s.png' % name)

 

 

goal = {}

ws = wb.active

for row in ws.values:

    if row[0] == '球员':

        pass

    else:

        goal[row[0]] = float(row[5])

generate_image(goal, "球员进球数词云")

 

#  --------------------------------生成进球柱状图数据------------------------------------------

 

 

wb = openpyxl.load_workbook('22-23赛季世界顶级联赛球员数据表.xlsx')

 

df = np.read_excel('22-23赛季世界顶级联赛球员数据表.xlsx',"球员数据表")

goals = df['进球'].values.tolist()

players = df["球员"].values.tolist()

# 清除掉进球数为0的项

count = len(goals)

remove_count = 0

i = 0

while count - remove_count > i:

    if goals[i] == 0:

        goals.pop(i)

        players.pop(i)

        remove_count += 1

        i -= 1

    i += 1

c = (

    Bar(init_opts=opts.InitOpts(width="1600px",

                                height="720px", theme=ThemeType.LIGHT))

    .add_xaxis(players)

    .add_yaxis("进球数", goals)

    # 生成文件

    .set_global_opts(

        title_opts={"text": "进球数"}

    )

    .render("球员进球数.html")

)

 

# --------------------------------生成射门数与进球数据对比------------------------------------------

 

 

df = np.read_excel('22-23赛季世界顶级联赛球员数据表.xlsx',"球员数据表")

goals = df['进球'].values.tolist()

shots = df['射门数'].values.tolist()

players = df["球员"].values.tolist()

# 清除掉射门数和进球数都为0的项

count = len(goals)

remove_count = 0

i = 0

while count - remove_count > i:

    if goals[i] == 0 and shots[i] == 0:

        goals.pop(i)

        shots.pop(i)

        players.pop(i)

        remove_count += 1

        i -= 1

    i += 1

 

c = (

    Bar(init_opts=opts.InitOpts(width="1600px",

                                height="720px", theme=ThemeType.LIGHT))

    .add_xaxis(players)

    .add_yaxis("进球数", goals, stack="stack1", category_gap="50%")

    .add_yaxis("射门数", shots, stack="stack1", category_gap="50%")

    # 生成文件

    .set_global_opts(

        title_opts={"text": "进球与射门数对比"}

    )

    .render("进球与射门数对比.html")

)

 

plt_data = df

plt_data.head()

 

sns.set()

 

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.grid()

 

sns.lmplot(x='场均传球', y='传球成功率(%)', data=plt_data)

plt.savefig("场均传球数与传球成功率的回归.png", dpi=300)

五、总结

  1.总结

         通过本次的课程设计学习,我们可以清晰的了解到世界顶级球员的实力。

       2.目标

   已经达到我预期的目标。通过对爬取的数据进行数据可视化分析,可以较便捷得看出世界的顶级球员的各方面实力和各方球队的实力强弱。

  3.自我建议

  (1)在项目里,无论使用新的 jar 包,还是用新的中间件,一定要去看官方文档。

现在网上的技术文章鱼龙混杂,再加上国内那个不咋地的搜索引擎,所以在网上搜靠谱的技术文章,就相当于在屎坑里捞金子。

  (2)多逛csdn等编程学习平台,扎实自身,并打开视野。

标签:search,name,true,课程设计,冠军联赛,柱状图,each,data,columns
From: https://www.cnblogs.com/raozhaoqi/p/16999252.html

相关文章