大数据分析——对世界杯比赛进行分析及预测

标签：数据分析比赛世界杯 df matches Team Home Name name

（一）选题背景：世界杯比赛是全球最大规模的国际足球赛事之一，吸引着数以亿计的观众。对于球迷和体育爱好者来说，预测比赛结果是一项有趣且具有挑战性的任务。足球比赛结果的预测可以帮助球迷制定投注策略、提供比赛观看的参考以及评估球队和球员的表现。

（二）方案设计：从网站中下载相关的数据集，对数据进行整理，在python环境下，对数据进行预处理。最后使用决策树算法构建世界杯比赛预测模型，测试模型。

数据来源：kaggle，网址：https://www.kaggle.com/ （三）方案实施步骤：1.下载数据集

2.导入第三方库

3.加载三个数据文件，并删除缺失值

4.数据分析

简单的分析一下世界杯的冠军情况

5.数据预处理

接着创建一个字典用来储存球队名称

删除不必要的列同时计算每支队伍成为冠军的次数

定义一个函数用于找出获胜队伍

将team-name字典中的团队名称替换为id

5.准备建模用到的X,y数据

打乱数据

6.模型构建

用svm支持向量机模型进行训练并打印模型的评估指标

决策树模型更加准确，最终选择决策树算法作为模型选择。

7.模型预测

这里使用2022年卡塔尔世界杯最后半决赛部分来检测模型效果

预测正确

可惜，预测失败

（四）实验源代码

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')
# 导入数据
matches = pd.read_csv('WorldCupMatches.csv')
players = pd.read_csv('WorldCupPlayers.csv')
cups = pd.read_csv('WorldCupsSummary.csv')
# 删除缺失值
matches = matches.dropna()
players = players.dropna()
cups = cups.dropna()

matches.head()
matches.head()
players.head()
# 世界杯赛冠军的数量
plt.figure(figsize=(12,6))
sns.countplot(x='Winner',data=cups)
plt.show()
# 用德国取代德国DR和德国FR，用俄罗斯取代苏联
def replace_name(df):
if(df['Home Team Name'] in ['German DR', 'Germany FR']):
df['Home Team Name'] = 'Germany'
elif(df['Home Team Name'] == 'Soviet Union'):
df['Home Team Name'] = 'Russia'
if(df['Away Team Name'] in ['German DR', 'Germany FR']):
df['Away Team Name'] = 'Germany'
elif(df['Away Team Name'] == 'Soviet Union'):
df['Away Team Name'] = 'Russia'
return df

matches = matches.apply(replace_name, axis='columns')
matches.head()
# 创建一个存储足球队的字典
team_name = {}
index = 0
for idx, row in matches.iterrows():
name = row['Home Team Name']
if(name not in team_name.keys()):
team_name[name] = index
index += 1
name = row['Away Team Name']
if(name not in team_name.keys()):
team_name[name] = index
index += 1

team_name
# 删除不必要的列
dropped_matches = matches.drop(['Datetime', 'Stadium', 'Referee', 'Assistant 1', 'Assistant 2', 'RoundID',
'Home Team Initials', 'Away Team Initials', 'Half-time Home Goals', 'Half-time Away Goals',
'Attendance', 'City', 'MatchID', 'Stage'], 1)
# 计算每支球队成为世界杯赛冠军的次数
championships = cups['Winner'].map(lambda p: 'Germany' if p=='Germany FR' else p).value_counts()
championships
# 加上“主队冠军”和“客场冠军”:获取世界杯冠军的次数
dropped_matches['Home Team Championship'] = 0
dropped_matches['Away Team Championship'] = 0

def count_championship(df):
if(championships.get(df['Home Team Name']) != None):
df['Home Team Championship'] = championships.get(df['Home Team Name'])
if(championships.get(df['Away Team Name']) != None):
df['Away Team Championship'] = championships.get(df['Away Team Name'])
return df

dropped_matches = dropped_matches.apply(count_championship, axis='columns')
dropped_matches.head()
# 定义一个函数用于找出谁赢了:主场胜:1，客场胜:2，平局:0
dropped_matches['Winner'] = '-'
def find_winner(df):
if(int(df['Home Team Goals']) == int(df['Away Team Goals'])):
df['Winner'] = 0
elif(int(df['Home Team Goals']) > int(df['Away Team Goals'])):
df['Winner'] = 1
else:
df['Winner'] = 2
return df

dropped_matches = dropped_matches.apply(find_winner, axis='columns')
dropped_matches.head()
# 将team_name字典中的团队名称替换为id
def replace_team_name_by_id(df):
df['Home Team Name'] = team_name[df['Home Team Name']]
df['Away Team Name'] = team_name[df['Away Team Name']]
return df

teamid_matches = dropped_matches.apply(replace_team_name_by_id, axis='columns')
teamid_matches.head()
# 删除不必要的列
teamid_matches = teamid_matches.drop(['Year', 'Home Team Goals', 'Away Team Goals'], 1)
teamid_matches.head()
X = teamid_matches[['Home Team Name', 'Away Team Name', 'Home Team Championship','Away Team Championship']]
X = np.array(X).astype('float64')
# 附加数据:只需将“主队名称”替换为“客场球队名称”，将“主队冠军”替换为“客场球队冠军”，然后替换结果
_X = X.copy()
_X[:,0] = X[:,1]
_X[:,1] = X[:,0]
_X[:,2] = X[:,3]
_X[:,3] = X[:,2]
y = dropped_matches['Winner']
y = np.array(y).astype('int')
y = np.reshape(y,(1,850))
y = y[0]
_y = y.copy()
for i in range(len(_y)):
if(_y[i]==1):
_y[i] = 2
elif(_y[i] ==2):
_y[i] = 1
X = np.concatenate((X,_X), axis= 0)
y = np.concatenate((y,_y))
print(X)
print(y)
# 打乱数据，然后拆分数据集为训练集和测试集
X,y = shuffle(X,y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 用SVM支持向量机模型进行训练
svm_model = SVC(kernel='rbf', class_weight='balanced', probability=True)
svm_model.fit(X, y)
print("Predicting on the test set")
y_pred = svm_model.predict(X_test)
print(svm_model.score(X_test,y_test))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred, labels=range(3)))
# 构建决策树模型
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier()
tree_model.fit(X, y)
print("Predicting on the test set")
y_pred = tree_model.predict(X_test)
print(tree_model.score(X_test,y_test))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# 定义一个预测函数，需要传递两个球队名称，输出两个获胜的概率
def prediction(team1, team2):
id1 = team_name[team1]
id2 = team_name[team2]
championship1 = championships.get(team1) if championships.get(team1) != None else 0
championship2 = championships.get(team2) if championships.get(team2) != None else 0
x = np.array([id1, id2, championship1, championship2]).astype('float64')
x = np.reshape(x, (1,-1))
_y = svm_model.predict_proba(x)[0]
text = ('Chance for '+team1+' to win '+team2+' is {}\nChance for '+team2+' to win '+team1+' is {}\nChance for '+team1+' and '+team2+' draw is {}').format(_y[1]*100,_y[2]*100,_y[0]*100)
return _y[0], text
# 预测英格兰对法国的比赛
prob, text = prediction('England', 'France')
print(text)
# 预测阿根廷对克罗地亚的比赛
prob, text = prediction('Argentina', 'Croatia')
print(text)
# 预测法国对摩洛哥的比赛
prob, text = prediction('France', 'Morocco')
print(text)
# 预测克罗地亚对摩洛哥的比赛
prob, text = prediction('Croatia', 'Morocco')
print(text)
# 预测阿根廷对法国的比赛
prob, text = prediction('Argentina','France')
print(text)

（五）总结：通过这次python实战，我学到了许多新知识，丰富了经验，缩小了实践和理论的差距。在今后的生活中，我将继续学习不断提升理论涵养，深入实践，提供自身综合素质。

标签：数据分析,比赛,世界杯,df,matches,Team,Home,Name,name
From： https://www.cnblogs.com/yu02-L/p/17461889.html

大数据分析——对世界杯比赛进行分析及预测

相关文章

赞助商

阅读排行