MATH6182辅导、辅导Python语言程序

2022.12.15 - 首页 >> CS

MATH6182 Coursework (2022-23)
December 1, 2022
1 Football Performance Analysis
1.1 Purpose
The aim of this coursework is to extend your basic Python knowledge and analyse a large data set
consisting of thousands of entries. Within this framework, you will need to use your programming
and analysis skills to calculate some vital statistics for individual football teams or leagues and
visualise the results appropriately. The ultimate challenge is to solve a constrained optimisation
problem to predict the outcome of individual football matches based on the available statistical
data and using CVXPY. All exercises must be performed and presented in a Jupyter notebook.
1.2 Data
The required data set can be found on Blackboard, it is called understat_per_game.csv. In
general, it contains statistical quantities of individual teams per game of each season from 2014 to
2019 for 6 UEFA leagues:
In this framework, there are the following standard parameters:
league - Name of the league,
season - Starting year of the season,
team - Name of the team,
h_a - Flag that indicates a home (“h”) or an away game (“a”),
result - Flag that indicates a win (“w”), a draw (“d”) or a lose (“l”),
date - The date and time at which the game took place,
wins - Flag that indicates that the team won the game,
draws - Flag that indicates that the game was a draw,
loses - Flag that indicates that the team lost the game,
scored - Number of goals scored in the game by the team,
missed - Number of goals missed in the game by the team,
pts - Number of points collected for this game by the team.
Moreover, the following additional metrics are available:
1
xG - Expected goals metric (statistical measure for quality of chances created and conceded),
xG_diff - Difference between actual goals scored and expected goals,
npxG - Expected goals without penalties and own goals,
xGA - Expected goals against,
xGA_diff - Difference between actual goals missed and expected goals against,
npxGA - Expected goals against without penalties and own goals,
npxGD - Difference between “for” and “against” xG without penalties and own goals,
ppda_coef - Passes allowed per defensive action in the opposition half (power of pressure),
oppda_coef - Opponent ppda_coef (power of opponent’s pressure),
deep - Passes completed within an estimated 20 yards of goal (crosses excluded),
deep_allowed - Opponent deep,
xpts - Expected points,
xpts_diff - Difference between actual and expected points.
The data set can be loaded into the Jupyter notebook based on the following code.
[1]: import pandas as pd
understat_per_game_data = pd.read_csv('understat_per_game.csv')
understat_per_game_data.shape
[1]: (24580, 29)
The data set contains around 24,500 rows and 29 columns.
[2]: understat_per_game_data.head(4)
[2]: league season h_a xG xGA npxG npxGA deep \
0 Bundesliga 2014 h 2.57012 1.198420 2.57012 1.198420 5
1 Bundesliga 2014 a 1.50328 1.307950 1.50328 1.307950 10
2 Bundesliga 2014 h 1.22987 0.310166 1.22987 0.310166 13
3 Bundesliga 2014 a 1.03519 0.203118 1.03519 0.203118 6
deep_allowed scored … ppda_coef ppda_att ppda_def oppda_coef \
0 4 2 … 9.625000 231 24 21.850000
1 1 1 … 4.756098 195 41 17.695652
2 3 2 … 5.060606 167 33 16.961538
3 2 0 … 4.423077 115 26 9.446809
oppda_att oppda_def team xG_diff xGA_diff xpts_diff
0 437 20 Bayern Munich 0.57012 0.198420 -0.6514
1 407 23 Bayern Munich 0.50328 0.307950 0.5143
2 441 26 Bayern Munich -0.77013 0.310166 -0.8412
3 444 47 Bayern Munich 1.03519 0.203118 1.1367
[4 rows x 29 columns]
2 Tasks
The coursework is divided into 5 tasks:
2
1. Create your own Data Set.
2. Exploratory Data Analysis and Data Manipulation.
3. Performance Analysis for Individual Teams.
4. Performance Analysis for Individual Leagues.
5. Match Prediction.
Each of these tasks has a weighting of 20% for the final grade.
Within this framework, Jürgen Klopp, as an experienced team manager, has been hired to lead the
entire data analysis project. Therefore, he will give you further instructions for all tasks throughout
the coursework.
2.1 Task 1: Create your own Data Set
Since Jürgen Klopp has decided that only data analysts with certain first letters of the first name
can view the data of a team in a certain league, we will all analyse different subsets. Hence, we
first have to make sure that we have a suitable data set for our analyses. So, depending on the first
letter of your first name (1), you will be assigned to one of the following leagues, whose data you
will need to add to your own data set:
?{{{{
?{{{{?
Bundesliga, if 1 ∈ {,,,}
EPL, if 1 ∈ {, ,,}
La Liga, if 1 ∈ {, ,,}
Ligue 1, if 1 ∈ {,,, }
RFPL, if 1 ∈ {,, , }
Serie A, if 1 ∈ {, ,,, , }
However, since your team will also play internationally next season, Jürgen says that we should
add the data of another league to our data set in order to be able to analyse our future opponents
as well. According to Klopp’s calculations, the first letter of your surname (2) decides the league
of one of the teams you will play against in the group stage. So, depending on the first letter of
your surname, add the data of another league to your own data set:

Serie A, if 2 ∈ {,,,}
Bundesliga, if 2 ∈ {, ,,}
EPL, if 2 ∈ {, ,,}
La Liga, if 2 ∈ {,,, }
Ligue 1, if 2 ∈ {,, , }
RFPL, if 2 ∈ {, ,,, , }
It should be noted that Jürgen Klopp has found some errors in his calculations and it may therefore
happen that you receive the same league in the second allocation as in the first allocation. In this
case, Klopp cannot give you any further information about your future opponents and you will have
to randomly select one of the remaining leagues instead, based on a draw with equal probabilities.
3
When doing so, do not forget to set a seed that corresponds to the last three digits of your student
ID before the random selection.
In addition, Jürgen believes that your personal preferences will also have a great influence on the
division of the teams for the group stage. Therefore, arrange the leagues according to your personal
preferences and save them as a list of strings. In first place should be the league you like best, and
so it goes on to the last league. In this context, check whether your first preference is already part
of the data set. If not, add it to your record. Otherwise, continue with the same check for your
second and third preference. This ensures that your data set now contains the matches of football
clubs from three different leagues. Next, also check whether your last preference is part of the data
set. If so, draw a random number in the interval [0, 1] and round it to zero decimals. Do not forget
to set a seed based on the first three digits of your student ID beforehand. If the rounded number
is a “1”, Klopp says that it is very unlikely that you will have to play against a team from that
league. Therefore, you should remove the data from your record and add the data from one of the
remaining unselected leagues based on a random selection with equal probabilities.
Finally, Klopp believes that some outsiders have manipulated the data in an unfavourable way.
Therefore, he proposes to remove a certain number of rows from the data set to get the final subset
with unmanipulated data entries. In this context, you need to draw a number from the interval
[0, 500] based on a random selection with equal probabilities that tells you how many manipulated
rows you have in your dataset and should therefore remove them. Again, do not forget to set a
seed beforehand that corresponds to your entire student ID. Note that you must also randomise
the indices of the entries you are removing.
Jürgen Klopp expects you to define a function that outputs your final data set. Your function
should have four arguments: The first is a pandas DataFrame, the second is your student ID, the
third and fourth are the first letters of your first and last name respectively and the last is the list
of your ordered preferences. At the beginning of the function, please also check that all the entries
in the function are in the correct format, i.e. check that
df - is a non-epmty pandas DataFrame.
student_id - is a string or an integer consisting of 8 letters or individual numbers.
first_name_start_letter - is a string consisting of 1 letter.
last_name_start_letter - is a string consistig of 1 letter.
ordered_preferences - is a list of six strings.
If at least one of the previously defined checks fails, issue an error message, exist the function and
correct the inputs so that you can successfully apply the function. In this framework, you might
find it helpful to inform yourself about the possibility to raise a TypeError. The final dataset can
be named as mydataset or with a name of your choice.
Please use the structure shown below to set up your function. If you do not comment your code,
Jürgen will be mad at you and you will not get full points for this task. Comments in the code are
a very important part of programming so that other programmers can understand your code.
[3]: import numpy as np
def new_dataset(df,
student_id,
first_name_start_letter,
4
last_name_start_letter,
ordered_preferences):
"""
Outputs your individual data set.
:param df: A pandas DataFrame that contains the
"understat_per_game_data.csv" data.
:param student_id: An integer or a string (whatever is
easier for you) that corresponds to
your student ID.
:param: first_name_start_letter: A string that contains the first
letter of your first name.
:param: last_name_start_letter: A string that contains the first
letter of your last name.
:param: ordered_preferences: A list that contains your ordered
preferences regarding the different
leagues.
:return: A pandas DataFrame containing an individual subset of the
df DataFrame.
"""
# 1. Check if function inputs are valid
# [...]
# 2. Subset data based on the starting letter of your first name
# [...]
# 3. Subset data based on the starting letter of your last name
np.random.seed(XXX)
# [...]
# 4. Check your first three preferences
# [...]
# 5. Check your last preference
np.random.seed(XXX)
# [...]
# 6. Remove random amount of rows
np.random.seed(XXXXXXXX)
# [...]
return my_dataset # output the new data set.
5
2.2 Task 2: Exploratory Data Analysis & Data Manipulations
2.2.1 Subtask 2.1
After you have received the relevant data set for your work, the first thing Jürgen Klopp wants to
do is to carry out a detailed review of the data properties. He is still a little afraid that outsiders
might have falsified the data so that our analyses lead to false conclusions. Therefore, as good data
analysts, we check whether we have read in the data correctly and whether there are any suspicious
observations. In this regard, write a function that checks the following properties:
1. All columns are formatted as indicated in the data description file (data_description.csv).
2. The data set does not contain any missing values.
If you find missing values, remove the corresponding rows from the data set. Jürgen says that these
instances have most likely been manipulated by an outsider. In general, your function should have
two inputs corresponding to the data set and the data description file, and it should also have two
outputs: A Pandas DataFrame with three columns [ID of the check, short description of the check
and a flag indicating whether the check failed (1) or passed (0)] and the cleaned version of your
data set.
Do not forget to check the function inputs as already done in the previous task.
Please use the structure shown below to set up your function and print out the head of both
outputs. If you do not comment your code, Jürgen will be mad at you and you will not get full
points for this task. Comments in the code are a very important part of programming so that other
programmers can understand your code.
[4]: def cleaned_dataset(df, data_descrip):
"""
Outputs the cleaned version of your individual data set.
:param df: A pandas DataFrame with your individual data.
:param data_descrip: A pandas DataFrame with the data description.
:return: A pandas DataFrame with a summary of the performed checks and
a pandas DataFrame with the cleaned input data.
"""
# 1. Check if function inputs are valid
# [...]
# 2. Creates an empty table for the summary of your checks
# [...]
# 3. Checks if all columns are in the correct format
# [...]
# 4. Checks if the data contains missing values
# [...]
return check_summary, my_dataset # output.
6
2.2.2 Subtask 2.2
Jürgen Klopp is now sure that we have good data quality, but he still thinks that we need to
do more data preparation to facilitate our later analyses. Therefore, he asks you to also include
transformed versions of certain columns in the data set. Within this framework, please do the
following data manipulations:
1. Add another variable to the data set called “h_a_transformed” which corresponds to a nu-
merical representation of the “h_a” variable, i.e. an “h” is decoded as 1 and an “a” as 0.
2. Based on the “date” column, add three more variables to the data set called “year”, “month”
and “match_time”. The first two should correspond to the year and month when the match
took place and the third should represent the time when the match started.
3. Based on the “wins”, “draws”, “loses” and “pts” columns, for each team calculate the total
number of previous wins, draws, loses and points during the respective season and add each
of these variables to the data set. The columns should be named based on the following
structure: “cumulative_wins”, “cumulative_pts”, […].
Please use the structure shown below to set up your code and at the end print out the head of
your data set. If you do not comment your code, Jürgen will be mad at you and you will not get
full points for this task. Comments in the code are a very important part of programming so that
other programmers can understand your code.
[5]: # Transformation of "h_a" column
# [...]
# Transformation of "date" column
# [...]
# Transformation of "wins", "draws", "loses" and "pts" column
# [...]
# Print out the first five rows of your data set
# [...]
2.2.3 Subtask 2.3
Finally, to get a first impression of the distributions of the different numerical variables, Klopp would
like to see their summary statistics. Therefore, print the following information for all numerical
columns in a combined table: Count, Mean, Standard Deviation, Minimum, 25th Quantile, 50th
Quantile, 75th Quantile and Maximum.
At the end print out the head of your data set. If you do not comment your code, Jürgen will
be mad at you and you will not get full points for this task. Comments in the code are a very
important part of programming so that other programmers can understand your code.
7
2.3 Task 3: Performance Analysis on Individual Teams
2.3.1 Subtask 3.1
Jürgen Klopp now thinks that you are ready for your first performance analysis. Because you are
such a talented data analyst, you have offers from all the clubs that are in your data set and you
are in the comfortable position of deciding which club you want to sign with.
After good negotiations with your chosen club, you are happy to start the analysis. The first thing
Jürgen wants to know is how well your team has done in recent seasons, depending on whether
they have played at home or away. He also asks you to assess whether there are some trends in the
summarised data.
Recall that each row of the data set represents the game statistics of a certain football team. Hence,
you have to write a function that subsets the data based on the team you picked and creates a
DataFrame that summarizes the following information grouped by the season and the type of the
game (i.e., home or away):
The count of wins, draws and loses and,
The average goals, expected goals and expected goals against.
Then the function should also create a plot with four line graphs as subplots to visualise the results
and to simplify later interpretation:
The subplots on the left side should contain three lines showing the evolution of the number
of wins, draws and defeats in home and away matches (y-axis) over the seasons (x-axis).
The subplots on the right side should also contain three lines showing the development of
average goals, expected goals and expected goals conceded for home and away matches (y-
axis) over the seasons (x-axis).
Do not forget to include a suitable legend explaining the individual lines.
At the end the plot should look like this (you can vary the colors if you want to):
8
Do not forget to check the function inputs as already done in the previous tasks. Interpret the
results for Jürgen, e.g., the total number of wins at home has increased over the seasons and […].
Please use the structure shown below to set up your function and print out the head of the output.
If you do not comment your code, Jürgen will be mad at you and you will not get full points for this
task. Comments in the code are a very important part of programming so that other programmers
can understand your code.
[6]: import matplotlib.pyplot as plt
def team_performance(df, team_name):
"""
Outputs the summary table for subtask 3.1.
:param df: A pandas DataFrame with your individual data.
:param own_team: A string corresponding to the name of your team.
:return: A pandas DataFrame with the summary table for your team.
"""
# 1. Check if function inputs are valid
#[...]
9
# 2. Compute statistics
# [...]
# 3. Plot the results
# [...]
return summary_tab
# Print out head of summary tab
2.3.2 Subtask 3.2
Jürgen Klopp now has an idea of how well our team played during last seasons, but he also wants
to be prepared for the next game. It will be a regular league match, so you have to pick another
team from the same league. In this context, he asks you to analyse the tactical behaviour of our
next opponent. Based on his experience, Klopp knows that a team usually changes its tactics over
the course of a season, so you need to group the data by the year and month in which the matches
took place. Jürgen also knows that the team managers of our opponent have changed frequently
in the past and therefore only the data of the last available season shows the tactical behaviour of
the current coach. Based on the previous information, write a function that subsets the data and
groups it accordingly, then it should calculate the following quantities:
The average power of pressure, power of opponent’s pressure, deep and deep allowed.
Then create a line plot to visualise the results and to simplify later interpretation:
The plot should contain four lines showing the evolution of the average power of pressure,
power of opponent’s pressure, deep and deep allowed (y-axis) over the season (x-axis).
At the end the plot should look like this (you can pick different colors if you want to):
10
Do not forget to check the function inputs as already done in the previous tasks. Interpret the
results for Jürgen, e.g., over the season the opponent team has increased its power of pressure and
[…].
Please use the structure shown below to set up your function and print out the head of the output.
If you do not comment your code, Jürgen will be mad at you and you will not get full points for this
task. Comments in the code are a very important part of programming so that other programmers
can understand your code.
[7]: def opponent_performance(df, opponent_name, season):
"""
Outputs the summary table for subtask 3.2.
:param df: A pandas DataFrame with your individual data.
:param opponent_name: A string corresponding to the name of your opponent?
?team.
:param season: An integer corresponding to the starting year of the?
?desired season.
:return: A pandas DataFrame with the summary table for the opponent team.
"""
# 1. Check if function inputs are valid
# [...]
# 2. Summarize and prepare the data for plotting
# [...]
# 3. Plot the results
# [...]
return summary_tab
# Print out head of summary table
2.3.3 Subtask 3.3
Finally, Jürgen Klopp asks you for one more favour. He feels that in recent years our team’s
performance and that of our opponents has depended on the starting time of the game, i.e. our
team has won more games in the evening than at noon. For this purpose, you need to write a
function that creates a plot that contains two different bar charts as subplots:
One for our team and
One for our next opponent.
The two plots should present the following information:
On the x-axis should be the distinct starting times of the games and
On the y-axis the count of wins, draws and loses.
11
At the end the plot should look like this (you can pick different colors if you want to):
Do not forget to check the function inputs as already done in the previous tasks. Interpret the
results and check whether Jürgen was right with his feeling, e.g., our opponent normally wins more
games at noon and […].
Please use the structure shown below to set up your function and print out the head of the output.
If you do not comment your code, Jürgen will be mad at you and you will not get full points for this
task. Comments in the code are a very important part of programming so that other programmers
can understand your code.
[8]: def game_time_performance(df, team_name, opponent_name, season):
"""
Outputs the two summary tables for subtask 3.3.
:param df: A pandas DataFrame with your individual data.
:param team_name: A string corresponding to the name of your own team.
:param opponent_name: A string corresponding to the name of your opponent?
team.
:param season: An integer corresponding to the starting year of the?
desired season.
:return: A pandas DataFrame with the summary table for your own team,
a pandas DataFrame with the summary table for the opponent team.
"""
# 1. Check if function inputs are valid
# [...]
# 2. Prepare the data for plotting
# [...]
# 3. Plot the results
12
# [...]
return summary_tab_own, summary_tab_opp
# Print out head of summary tables
2.4 Task 4: Performance Analysis on Individual Leagues
Since our team will play in the Champions League next season, Jürgen Klopp also wants to be
prepared for international matches. Therefore, he is interested in a tool that gives him the same
analysis as in subtasks 3.1 and 3.2, but for a specific league rather than for an individual team. He
wants to use the tool to evaluate the overall strength of the league and to get an idea of what kind
of football is played in this country.
Therefore, you need to write a function for Klopp that receives a data frame and a league name as
inputs and outputs the two summary tables as in subtasks 3.1 and 3.2 for the appropriate league.
Finally, apply the function to all the leagues in your data set and interpret the results. In particular,
try to find differences in the tactical behaviour of the teams in the individual leagues, e.g. in the
Bundesliga the home teams have on average a higher pressure strength than in the other available
leagues.
Do not forget to check the function inputs as already done in the previous tasks. Please use the
structure shown below to set up your function and print out the head of the output. If you do
not comment your code, Jürgen will be mad at you and you will not get full points for this task.
Comments in the code are a very important part of programming so that other programmers can
understand your code.
[9]: def performance_analysis(df, league_name, season):
"""
Outputs the summary tables of subtasks 3.2 and 3.3 for an individual league?
and season.
In the framework of this function also the corresponding plots are printed.
:param df: A pandas DataFrame with your individual data.
:param league_name: A string corresponding to a league name.
:param season: An integer corresponding to the starting year of the
desired season.
:return: A pandas DataFrame with the summary table of subtask 3.2,
a pandas DataFrame with the summary table of subtask 3.3.
"""
# 1. Check if function inputs are valid
# [...]
# 2. Prepare the data for plotting
# [...]
13
# 3. Plot the results
# [...]
return summary_tab_1, summary_tab_2
# Compute and print out results
# [...]
2.5 Task 5: Game Prediction
Ultimately, Jürgen Klopp wants to have a model that predicts whether a team will win a match
based on the available match statistics. His idea is to apply this model during a match so that he
can change the tactical behaviour of our team in the right way. He knows that you are an expert
in linear programming and that you have used the library PuLP a lot. He has heard about the
linear Support Vector Machine and wants to challenge your data analyst skills a little. Therefore,
he gives you the task of calculating such a model for the given data set using CVXPY. CVXPY is
another optimisation library in Python, similar to PuLP, but it is designed for solving general
convex optimisation problems. You might find helpful information about the new library on this
webpage: https://www.cvxpy.org/.
The basic idea of a linear Support Vector Machine (SVM) is to find the maximum-margin hyper-
plane that separates two clases, i.e., wins and no wins, while allwoing for some violations controlled
by a parameter . A graphical representation based on two independet variables can be found
below (Source: Chapter 5 of the Python Data Science Handbook by Jake VanderPlas):
This general idea then can be formulated as a soft-margin classification problem with the following
objective function that we want to minimize:

max(0, 1 (wx )) + ‖w‖2
where ∈ ? represents the total number of games, x ∈ ? a vector containing the ∈ ?
independent variables available for a single game ∈ {1,… , }, ∈ {1,?1} the decoded target
variable (i.e., 1 for a win and -1 for not a win), w ∈ ? the vector containing the parameters for
the independent variables, ∈ ? is the intercept and > 0 is the previously mentioned penalty
14
parameter controlling the trade-off between a larger margin and the correct classification of the
instances.
By introducing individual slack variables = max(0, 1? (wx?)) for each instance the pre-
vioulsy defined objective function can be transformed into the following constrainted minimization
problem
Read the general documentation of CVXPY and solve the optimisation task in such a way that the
team leader gets his tool. Print your solution and the optimal unconstrainted objective function
value. Compare the predictions with the actual values and print the percentage of correctly classified
matches.
Note that predictions can be made by using the sign function:
sign(wx ? )
You should use the following columns as independet variables: h_a_transformed, xG, npxG,
npxGA, deep, deep_allowed, missed, npxGD, ppda_coef, ppda_att, ppda_def, oppda_coef,
oppda_att, oppda_def, xGA_diff, cumulative_wins, cumulative_draws, cumulative_loses and
cumulative_pts. Further, to train your model you should use 80% of the available data that you
randomly select. In addition, you must standardise all variables beforehand, as the Support Vector
Machine is sensitive to the scaling of the data. Therefore, subtract the empirical mean from each
variable and then divide it by the empirical standard deviation. The remaining data should be
used as testing data to check the out-of-sample performance. Do not forget to apply the same
standardization to the testing data.
If you are interested in getting a deeper insight into how the Linear Support Vector Machine works,
take a look at Chapter 5 of the Python Data Science Handbook by Jake VanderPlas.
Please use the structure shown below to set up your code and print out the value of objective
function, the percentage of correctly classified training and testing observations. If you do not
comment your code, Jürgen will be mad at you and you will not get full points for this task.
Comments in the code are a very important part of programming so that other programmers can
understand your code.
Hint: CVXPY needs all inputs as numpy.array. For the C parameter you can try different values
and see which gives you the best objective function value, a good starting point might be values
between (0, 1].