# MATH6182辅导、辅导Python语言程序

- 首页 >> CS MATH6182 Coursework (2022-23)

December 1, 2022

1 Football Performance Analysis

1.1 Purpose

The aim of this coursework is to extend your basic Python knowledge and analyse a large data set

consisting of thousands of entries. Within this framework, you will need to use your programming

and analysis skills to calculate some vital statistics for individual football teams or leagues and

visualise the results appropriately. The ultimate challenge is to solve a constrained optimisation

problem to predict the outcome of individual football matches based on the available statistical

data and using CVXPY. All exercises must be performed and presented in a Jupyter notebook.

1.2 Data

The required data set can be found on Blackboard, it is called understat_per_game.csv. In

general, it contains statistical quantities of individual teams per game of each season from 2014 to

2019 for 6 UEFA leagues:

In this framework, there are the following standard parameters:

league - Name of the league,

season - Starting year of the season,

team - Name of the team,

h_a - Flag that indicates a home (“h”) or an away game (“a”),

result - Flag that indicates a win (“w”), a draw (“d”) or a lose (“l”),

date - The date and time at which the game took place,

wins - Flag that indicates that the team won the game,

draws - Flag that indicates that the game was a draw,

loses - Flag that indicates that the team lost the game,

scored - Number of goals scored in the game by the team,

missed - Number of goals missed in the game by the team,

pts - Number of points collected for this game by the team.

Moreover, the following additional metrics are available:

1

xG - Expected goals metric (statistical measure for quality of chances created and conceded),

xG_diff - Difference between actual goals scored and expected goals,

npxG - Expected goals without penalties and own goals,

xGA - Expected goals against,

xGA_diff - Difference between actual goals missed and expected goals against,

npxGA - Expected goals against without penalties and own goals,

npxGD - Difference between “for” and “against” xG without penalties and own goals,

ppda_coef - Passes allowed per defensive action in the opposition half (power of pressure),

oppda_coef - Opponent ppda_coef (power of opponent’s pressure),

deep - Passes completed within an estimated 20 yards of goal (crosses excluded),

deep_allowed - Opponent deep,

xpts - Expected points,

xpts_diff - Difference between actual and expected points.

The data set can be loaded into the Jupyter notebook based on the following code.

[1]: import pandas as pd

understat_per_game_data = pd.read_csv('understat_per_game.csv')

understat_per_game_data.shape

[1]: (24580, 29)

The data set contains around 24,500 rows and 29 columns.

[2]: understat_per_game_data.head(4)

[2]: league season h_a xG xGA npxG npxGA deep \

0 Bundesliga 2014 h 2.57012 1.198420 2.57012 1.198420 5

1 Bundesliga 2014 a 1.50328 1.307950 1.50328 1.307950 10

2 Bundesliga 2014 h 1.22987 0.310166 1.22987 0.310166 13

3 Bundesliga 2014 a 1.03519 0.203118 1.03519 0.203118 6

deep_allowed scored … ppda_coef ppda_att ppda_def oppda_coef \

0 4 2 … 9.625000 231 24 21.850000

1 1 1 … 4.756098 195 41 17.695652

2 3 2 … 5.060606 167 33 16.961538

3 2 0 … 4.423077 115 26 9.446809

oppda_att oppda_def team xG_diff xGA_diff xpts_diff

0 437 20 Bayern Munich 0.57012 0.198420 -0.6514

1 407 23 Bayern Munich 0.50328 0.307950 0.5143

2 441 26 Bayern Munich -0.77013 0.310166 -0.8412

3 444 47 Bayern Munich 1.03519 0.203118 1.1367

[4 rows x 29 columns]

2 Tasks

The coursework is divided into 5 tasks:

2

1. Create your own Data Set.

2. Exploratory Data Analysis and Data Manipulation.

3. Performance Analysis for Individual Teams.

4. Performance Analysis for Individual Leagues.

5. Match Prediction.

Each of these tasks has a weighting of 20% for the final grade.

Within this framework, Jürgen Klopp, as an experienced team manager, has been hired to lead the

entire data analysis project. Therefore, he will give you further instructions for all tasks throughout

the coursework.

2.1 Task 1: Create your own Data Set

Since Jürgen Klopp has decided that only data analysts with certain first letters of the first name

can view the data of a team in a certain league, we will all analyse different subsets. Hence, we

first have to make sure that we have a suitable data set for our analyses. So, depending on the first

letter of your first name (1), you will be assigned to one of the following leagues, whose data you

will need to add to your own data set:

?{{{{

?{{{{?

Bundesliga, if 1 ∈ {,,,}

EPL, if 1 ∈ {, ,,}

La Liga, if 1 ∈ {, ,,}

Ligue 1, if 1 ∈ {,,, }

RFPL, if 1 ∈ {,, , }

Serie A, if 1 ∈ {, ,,, , }

However, since your team will also play internationally next season, Jürgen says that we should

add the data of another league to our data set in order to be able to analyse our future opponents

as well. According to Klopp’s calculations, the first letter of your surname (2) decides the league

of one of the teams you will play against in the group stage. So, depending on the first letter of

your surname, add the data of another league to your own data set:

Serie A, if 2 ∈ {,,,}

Bundesliga, if 2 ∈ {, ,,}

EPL, if 2 ∈ {, ,,}

La Liga, if 2 ∈ {,,, }

Ligue 1, if 2 ∈ {,, , }

RFPL, if 2 ∈ {, ,,, , }

It should be noted that Jürgen Klopp has found some errors in his calculations and it may therefore

happen that you receive the same league in the second allocation as in the first allocation. In this

case, Klopp cannot give you any further information about your future opponents and you will have

to randomly select one of the remaining leagues instead, based on a draw with equal probabilities.

3

When doing so, do not forget to set a seed that corresponds to the last three digits of your student

ID before the random selection.

In addition, Jürgen believes that your personal preferences will also have a great influence on the

division of the teams for the group stage. Therefore, arrange the leagues according to your personal

preferences and save them as a list of strings. In first place should be the league you like best, and

so it goes on to the last league. In this context, check whether your first preference is already part

of the data set. If not, add it to your record. Otherwise, continue with the same check for your

second and third preference. This ensures that your data set now contains the matches of football

clubs from three different leagues. Next, also check whether your last preference is part of the data

set. If so, draw a random number in the interval [0, 1] and round it to zero decimals. Do not forget

to set a seed based on the first three digits of your student ID beforehand. If the rounded number

is a “1”, Klopp says that it is very unlikely that you will have to play against a team from that

league. Therefore, you should remove the data from your record and add the data from one of the

remaining unselected leagues based on a random selection with equal probabilities.

Finally, Klopp believes that some outsiders have manipulated the data in an unfavourable way.

Therefore, he proposes to remove a certain number of rows from the data set to get the final subset

with unmanipulated data entries. In this context, you need to draw a number from the interval

[0, 500] based on a random selection with equal probabilities that tells you how many manipulated

rows you have in your dataset and should therefore remove them. Again, do not forget to set a

seed beforehand that corresponds to your entire student ID. Note that you must also randomise

the indices of the entries you are removing.

Jürgen Klopp expects you to define a function that outputs your final data set. Your function

should have four arguments: The first is a pandas DataFrame, the second is your student ID, the

third and fourth are the first letters of your first and last name respectively and the last is the list

of your ordered preferences. At the beginning of the function, please also check that all the entries

in the function are in the correct format, i.e. check that

df - is a non-epmty pandas DataFrame.

student_id - is a string or an integer consisting of 8 letters or individual numbers.

first_name_start_letter - is a string consisting of 1 letter.

last_name_start_letter - is a string consistig of 1 letter.

ordered_preferences - is a list of six strings.

If at least one of the previously defined checks fails, issue an error message, exist the function and

correct the inputs so that you can successfully apply the function. In this framework, you might

find it helpful to inform yourself about the possibility to raise a TypeError. The final dataset can

be named as mydataset or with a name of your choice.

Please use the structure shown below to set up your function. If you do not comment your code,

Jürgen will be mad at you and you will not get full points for this task. Comments in the code are

a very important part of programming so that other programmers can understand your code.

[3]: import numpy as np

def new_dataset(df,

student_id,

first_name_start_letter,

4

last_name_start_letter,

ordered_preferences):

"""

Outputs your individual data set.

:param df: A pandas DataFrame that contains the

"understat_per_game_data.csv" data.

:param student_id: An integer or a string (whatever is

easier for you) that corresponds to

your student ID.

:param: first_name_start_letter: A string that contains the first

letter of your first name.

:param: last_name_start_letter: A string that contains the first

letter of your last name.

:param: ordered_preferences: A list that contains your ordered

preferences regarding the different

leagues.

:return: A pandas DataFrame containing an individual subset of the

df DataFrame.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Subset data based on the starting letter of your first name

# [...]

# 3. Subset data based on the starting letter of your last name

np.random.seed(XXX)

# [...]

# 4. Check your first three preferences

# [...]

# 5. Check your last preference

np.random.seed(XXX)

# [...]

# 6. Remove random amount of rows

np.random.seed(XXXXXXXX)

# [...]

return my_dataset # output the new data set.

5

2.2 Task 2: Exploratory Data Analysis & Data Manipulations

2.2.1 Subtask 2.1

After you have received the relevant data set for your work, the first thing Jürgen Klopp wants to

do is to carry out a detailed review of the data properties. He is still a little afraid that outsiders

might have falsified the data so that our analyses lead to false conclusions. Therefore, as good data

analysts, we check whether we have read in the data correctly and whether there are any suspicious

observations. In this regard, write a function that checks the following properties:

1. All columns are formatted as indicated in the data description file (data_description.csv).

2. The data set does not contain any missing values.

If you find missing values, remove the corresponding rows from the data set. Jürgen says that these

instances have most likely been manipulated by an outsider. In general, your function should have

two inputs corresponding to the data set and the data description file, and it should also have two

outputs: A Pandas DataFrame with three columns [ID of the check, short description of the check

and a flag indicating whether the check failed (1) or passed (0)] and the cleaned version of your

data set.

Do not forget to check the function inputs as already done in the previous task.

Please use the structure shown below to set up your function and print out the head of both

outputs. If you do not comment your code, Jürgen will be mad at you and you will not get full

points for this task. Comments in the code are a very important part of programming so that other

programmers can understand your code.

[4]: def cleaned_dataset(df, data_descrip):

"""

Outputs the cleaned version of your individual data set.

:param df: A pandas DataFrame with your individual data.

:param data_descrip: A pandas DataFrame with the data description.

:return: A pandas DataFrame with a summary of the performed checks and

a pandas DataFrame with the cleaned input data.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Creates an empty table for the summary of your checks

# [...]

# 3. Checks if all columns are in the correct format

# [...]

# 4. Checks if the data contains missing values

# [...]

return check_summary, my_dataset # output.

6

2.2.2 Subtask 2.2

Jürgen Klopp is now sure that we have good data quality, but he still thinks that we need to

do more data preparation to facilitate our later analyses. Therefore, he asks you to also include

transformed versions of certain columns in the data set. Within this framework, please do the

following data manipulations:

1. Add another variable to the data set called “h_a_transformed” which corresponds to a nu-

merical representation of the “h_a” variable, i.e. an “h” is decoded as 1 and an “a” as 0.

2. Based on the “date” column, add three more variables to the data set called “year”, “month”

and “match_time”. The first two should correspond to the year and month when the match

took place and the third should represent the time when the match started.

3. Based on the “wins”, “draws”, “loses” and “pts” columns, for each team calculate the total

number of previous wins, draws, loses and points during the respective season and add each

of these variables to the data set. The columns should be named based on the following

structure: “cumulative_wins”, “cumulative_pts”, […].

Please use the structure shown below to set up your code and at the end print out the head of

your data set. If you do not comment your code, Jürgen will be mad at you and you will not get

full points for this task. Comments in the code are a very important part of programming so that

other programmers can understand your code.

[5]: # Transformation of "h_a" column

# [...]

# Transformation of "date" column

# [...]

# Transformation of "wins", "draws", "loses" and "pts" column

# [...]

# Print out the first five rows of your data set

# [...]

2.2.3 Subtask 2.3

Finally, to get a first impression of the distributions of the different numerical variables, Klopp would

like to see their summary statistics. Therefore, print the following information for all numerical

columns in a combined table: Count, Mean, Standard Deviation, Minimum, 25th Quantile, 50th

Quantile, 75th Quantile and Maximum.

At the end print out the head of your data set. If you do not comment your code, Jürgen will

be mad at you and you will not get full points for this task. Comments in the code are a very

important part of programming so that other programmers can understand your code.

7

2.3 Task 3: Performance Analysis on Individual Teams

2.3.1 Subtask 3.1

Jürgen Klopp now thinks that you are ready for your first performance analysis. Because you are

such a talented data analyst, you have offers from all the clubs that are in your data set and you

are in the comfortable position of deciding which club you want to sign with.

After good negotiations with your chosen club, you are happy to start the analysis. The first thing

Jürgen wants to know is how well your team has done in recent seasons, depending on whether

they have played at home or away. He also asks you to assess whether there are some trends in the

summarised data.

Recall that each row of the data set represents the game statistics of a certain football team. Hence,

you have to write a function that subsets the data based on the team you picked and creates a

DataFrame that summarizes the following information grouped by the season and the type of the

game (i.e., home or away):

The count of wins, draws and loses and,

The average goals, expected goals and expected goals against.

Then the function should also create a plot with four line graphs as subplots to visualise the results

and to simplify later interpretation:

The subplots on the left side should contain three lines showing the evolution of the number

of wins, draws and defeats in home and away matches (y-axis) over the seasons (x-axis).

The subplots on the right side should also contain three lines showing the development of

average goals, expected goals and expected goals conceded for home and away matches (y-

axis) over the seasons (x-axis).

Do not forget to include a suitable legend explaining the individual lines.

At the end the plot should look like this (you can vary the colors if you want to):

8

Do not forget to check the function inputs as already done in the previous tasks. Interpret the

results for Jürgen, e.g., the total number of wins at home has increased over the seasons and […].

Please use the structure shown below to set up your function and print out the head of the output.

If you do not comment your code, Jürgen will be mad at you and you will not get full points for this

task. Comments in the code are a very important part of programming so that other programmers

can understand your code.

[6]: import matplotlib.pyplot as plt

def team_performance(df, team_name):

"""

Outputs the summary table for subtask 3.1.

:param df: A pandas DataFrame with your individual data.

:param own_team: A string corresponding to the name of your team.

:return: A pandas DataFrame with the summary table for your team.

"""

# 1. Check if function inputs are valid

#[...]

9

# 2. Compute statistics

# [...]

# 3. Plot the results

# [...]

return summary_tab

# Print out head of summary tab

2.3.2 Subtask 3.2

Jürgen Klopp now has an idea of how well our team played during last seasons, but he also wants

to be prepared for the next game. It will be a regular league match, so you have to pick another

team from the same league. In this context, he asks you to analyse the tactical behaviour of our

next opponent. Based on his experience, Klopp knows that a team usually changes its tactics over

the course of a season, so you need to group the data by the year and month in which the matches

took place. Jürgen also knows that the team managers of our opponent have changed frequently

in the past and therefore only the data of the last available season shows the tactical behaviour of

the current coach. Based on the previous information, write a function that subsets the data and

groups it accordingly, then it should calculate the following quantities:

The average power of pressure, power of opponent’s pressure, deep and deep allowed.

Then create a line plot to visualise the results and to simplify later interpretation:

The plot should contain four lines showing the evolution of the average power of pressure,

power of opponent’s pressure, deep and deep allowed (y-axis) over the season (x-axis).

At the end the plot should look like this (you can pick different colors if you want to):

10

Do not forget to check the function inputs as already done in the previous tasks. Interpret the

results for Jürgen, e.g., over the season the opponent team has increased its power of pressure and

[…].

Please use the structure shown below to set up your function and print out the head of the output.

If you do not comment your code, Jürgen will be mad at you and you will not get full points for this

task. Comments in the code are a very important part of programming so that other programmers

can understand your code.

[7]: def opponent_performance(df, opponent_name, season):

"""

Outputs the summary table for subtask 3.2.

:param df: A pandas DataFrame with your individual data.

:param opponent_name: A string corresponding to the name of your opponent?

?team.

:param season: An integer corresponding to the starting year of the?

?desired season.

:return: A pandas DataFrame with the summary table for the opponent team.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Summarize and prepare the data for plotting

# [...]

# 3. Plot the results

# [...]

return summary_tab

# Print out head of summary table

2.3.3 Subtask 3.3

Finally, Jürgen Klopp asks you for one more favour. He feels that in recent years our team’s

performance and that of our opponents has depended on the starting time of the game, i.e. our

team has won more games in the evening than at noon. For this purpose, you need to write a

function that creates a plot that contains two different bar charts as subplots:

One for our team and

One for our next opponent.

The two plots should present the following information:

On the x-axis should be the distinct starting times of the games and

On the y-axis the count of wins, draws and loses.

11

At the end the plot should look like this (you can pick different colors if you want to):

Do not forget to check the function inputs as already done in the previous tasks. Interpret the

results and check whether Jürgen was right with his feeling, e.g., our opponent normally wins more

games at noon and […].

Please use the structure shown below to set up your function and print out the head of the output.

If you do not comment your code, Jürgen will be mad at you and you will not get full points for this

task. Comments in the code are a very important part of programming so that other programmers

can understand your code.

[8]: def game_time_performance(df, team_name, opponent_name, season):

"""

Outputs the two summary tables for subtask 3.3.

:param df: A pandas DataFrame with your individual data.

:param team_name: A string corresponding to the name of your own team.

:param opponent_name: A string corresponding to the name of your opponent?

team.

:param season: An integer corresponding to the starting year of the?

desired season.

:return: A pandas DataFrame with the summary table for your own team,

a pandas DataFrame with the summary table for the opponent team.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Prepare the data for plotting

# [...]

# 3. Plot the results

12

# [...]

return summary_tab_own, summary_tab_opp

# Print out head of summary tables

2.4 Task 4: Performance Analysis on Individual Leagues

Since our team will play in the Champions League next season, Jürgen Klopp also wants to be

prepared for international matches. Therefore, he is interested in a tool that gives him the same

analysis as in subtasks 3.1 and 3.2, but for a specific league rather than for an individual team. He

wants to use the tool to evaluate the overall strength of the league and to get an idea of what kind

of football is played in this country.

Therefore, you need to write a function for Klopp that receives a data frame and a league name as

inputs and outputs the two summary tables as in subtasks 3.1 and 3.2 for the appropriate league.

Finally, apply the function to all the leagues in your data set and interpret the results. In particular,

try to find differences in the tactical behaviour of the teams in the individual leagues, e.g. in the

Bundesliga the home teams have on average a higher pressure strength than in the other available

leagues.

Do not forget to check the function inputs as already done in the previous tasks. Please use the

structure shown below to set up your function and print out the head of the output. If you do

not comment your code, Jürgen will be mad at you and you will not get full points for this task.

Comments in the code are a very important part of programming so that other programmers can

understand your code.

[9]: def performance_analysis(df, league_name, season):

"""

Outputs the summary tables of subtasks 3.2 and 3.3 for an individual league?

and season.

In the framework of this function also the corresponding plots are printed.

:param df: A pandas DataFrame with your individual data.

:param league_name: A string corresponding to a league name.

:param season: An integer corresponding to the starting year of the

desired season.

:return: A pandas DataFrame with the summary table of subtask 3.2,

a pandas DataFrame with the summary table of subtask 3.3.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Prepare the data for plotting

# [...]

13

# 3. Plot the results

# [...]

return summary_tab_1, summary_tab_2

# Compute and print out results

# [...]

2.5 Task 5: Game Prediction

Ultimately, Jürgen Klopp wants to have a model that predicts whether a team will win a match

based on the available match statistics. His idea is to apply this model during a match so that he

can change the tactical behaviour of our team in the right way. He knows that you are an expert

in linear programming and that you have used the library PuLP a lot. He has heard about the

linear Support Vector Machine and wants to challenge your data analyst skills a little. Therefore,

he gives you the task of calculating such a model for the given data set using CVXPY. CVXPY is

another optimisation library in Python, similar to PuLP, but it is designed for solving general

convex optimisation problems. You might find helpful information about the new library on this

webpage: https://www.cvxpy.org/.

The basic idea of a linear Support Vector Machine (SVM) is to find the maximum-margin hyper-

plane that separates two clases, i.e., wins and no wins, while allwoing for some violations controlled

by a parameter . A graphical representation based on two independet variables can be found

below (Source: Chapter 5 of the Python Data Science Handbook by Jake VanderPlas):

This general idea then can be formulated as a soft-margin classification problem with the following

objective function that we want to minimize:

max(0, 1 (wx )) + ‖w‖2

where ∈ ? represents the total number of games, x ∈ ? a vector containing the ∈ ?

independent variables available for a single game ∈ {1,… , }, ∈ {1,?1} the decoded target

variable (i.e., 1 for a win and -1 for not a win), w ∈ ? the vector containing the parameters for

the independent variables, ∈ ? is the intercept and > 0 is the previously mentioned penalty

14

parameter controlling the trade-off between a larger margin and the correct classification of the

instances.

By introducing individual slack variables = max(0, 1? (wx?)) for each instance the pre-

vioulsy defined objective function can be transformed into the following constrainted minimization

problem

Read the general documentation of CVXPY and solve the optimisation task in such a way that the

team leader gets his tool. Print your solution and the optimal unconstrainted objective function

value. Compare the predictions with the actual values and print the percentage of correctly classified

matches.

Note that predictions can be made by using the sign function:

sign(wx ? )

You should use the following columns as independet variables: h_a_transformed, xG, npxG,

npxGA, deep, deep_allowed, missed, npxGD, ppda_coef, ppda_att, ppda_def, oppda_coef,

oppda_att, oppda_def, xGA_diff, cumulative_wins, cumulative_draws, cumulative_loses and

cumulative_pts. Further, to train your model you should use 80% of the available data that you

randomly select. In addition, you must standardise all variables beforehand, as the Support Vector

Machine is sensitive to the scaling of the data. Therefore, subtract the empirical mean from each

variable and then divide it by the empirical standard deviation. The remaining data should be

used as testing data to check the out-of-sample performance. Do not forget to apply the same

standardization to the testing data.

If you are interested in getting a deeper insight into how the Linear Support Vector Machine works,

take a look at Chapter 5 of the Python Data Science Handbook by Jake VanderPlas.

Please use the structure shown below to set up your code and print out the value of objective

function, the percentage of correctly classified training and testing observations. If you do not

comment your code, Jürgen will be mad at you and you will not get full points for this task.

Comments in the code are a very important part of programming so that other programmers can

understand your code.

Hint: CVXPY needs all inputs as numpy.array. For the C parameter you can try different values

and see which gives you the best objective function value, a good starting point might be values

between (0, 1].

December 1, 2022

1 Football Performance Analysis

1.1 Purpose

The aim of this coursework is to extend your basic Python knowledge and analyse a large data set

consisting of thousands of entries. Within this framework, you will need to use your programming

and analysis skills to calculate some vital statistics for individual football teams or leagues and

visualise the results appropriately. The ultimate challenge is to solve a constrained optimisation

problem to predict the outcome of individual football matches based on the available statistical

data and using CVXPY. All exercises must be performed and presented in a Jupyter notebook.

1.2 Data

The required data set can be found on Blackboard, it is called understat_per_game.csv. In

general, it contains statistical quantities of individual teams per game of each season from 2014 to

2019 for 6 UEFA leagues:

In this framework, there are the following standard parameters:

league - Name of the league,

season - Starting year of the season,

team - Name of the team,

h_a - Flag that indicates a home (“h”) or an away game (“a”),

result - Flag that indicates a win (“w”), a draw (“d”) or a lose (“l”),

date - The date and time at which the game took place,

wins - Flag that indicates that the team won the game,

draws - Flag that indicates that the game was a draw,

loses - Flag that indicates that the team lost the game,

scored - Number of goals scored in the game by the team,

missed - Number of goals missed in the game by the team,

pts - Number of points collected for this game by the team.

Moreover, the following additional metrics are available:

1

xG - Expected goals metric (statistical measure for quality of chances created and conceded),

xG_diff - Difference between actual goals scored and expected goals,

npxG - Expected goals without penalties and own goals,

xGA - Expected goals against,

xGA_diff - Difference between actual goals missed and expected goals against,

npxGA - Expected goals against without penalties and own goals,

npxGD - Difference between “for” and “against” xG without penalties and own goals,

ppda_coef - Passes allowed per defensive action in the opposition half (power of pressure),

oppda_coef - Opponent ppda_coef (power of opponent’s pressure),

deep - Passes completed within an estimated 20 yards of goal (crosses excluded),

deep_allowed - Opponent deep,

xpts - Expected points,

xpts_diff - Difference between actual and expected points.

The data set can be loaded into the Jupyter notebook based on the following code.

[1]: import pandas as pd

understat_per_game_data = pd.read_csv('understat_per_game.csv')

understat_per_game_data.shape

[1]: (24580, 29)

The data set contains around 24,500 rows and 29 columns.

[2]: understat_per_game_data.head(4)

[2]: league season h_a xG xGA npxG npxGA deep \

0 Bundesliga 2014 h 2.57012 1.198420 2.57012 1.198420 5

1 Bundesliga 2014 a 1.50328 1.307950 1.50328 1.307950 10

2 Bundesliga 2014 h 1.22987 0.310166 1.22987 0.310166 13

3 Bundesliga 2014 a 1.03519 0.203118 1.03519 0.203118 6

deep_allowed scored … ppda_coef ppda_att ppda_def oppda_coef \

0 4 2 … 9.625000 231 24 21.850000

1 1 1 … 4.756098 195 41 17.695652

2 3 2 … 5.060606 167 33 16.961538

3 2 0 … 4.423077 115 26 9.446809

oppda_att oppda_def team xG_diff xGA_diff xpts_diff

0 437 20 Bayern Munich 0.57012 0.198420 -0.6514

1 407 23 Bayern Munich 0.50328 0.307950 0.5143

2 441 26 Bayern Munich -0.77013 0.310166 -0.8412

3 444 47 Bayern Munich 1.03519 0.203118 1.1367

[4 rows x 29 columns]

2 Tasks

The coursework is divided into 5 tasks:

2

1. Create your own Data Set.

2. Exploratory Data Analysis and Data Manipulation.

3. Performance Analysis for Individual Teams.

4. Performance Analysis for Individual Leagues.

5. Match Prediction.

Each of these tasks has a weighting of 20% for the final grade.

Within this framework, Jürgen Klopp, as an experienced team manager, has been hired to lead the

entire data analysis project. Therefore, he will give you further instructions for all tasks throughout

the coursework.

2.1 Task 1: Create your own Data Set

Since Jürgen Klopp has decided that only data analysts with certain first letters of the first name

can view the data of a team in a certain league, we will all analyse different subsets. Hence, we

first have to make sure that we have a suitable data set for our analyses. So, depending on the first

letter of your first name (1), you will be assigned to one of the following leagues, whose data you

will need to add to your own data set:

?{{{{

?{{{{?

Bundesliga, if 1 ∈ {,,,}

EPL, if 1 ∈ {, ,,}

La Liga, if 1 ∈ {, ,,}

Ligue 1, if 1 ∈ {,,, }

RFPL, if 1 ∈ {,, , }

Serie A, if 1 ∈ {, ,,, , }

However, since your team will also play internationally next season, Jürgen says that we should

add the data of another league to our data set in order to be able to analyse our future opponents

as well. According to Klopp’s calculations, the first letter of your surname (2) decides the league

of one of the teams you will play against in the group stage. So, depending on the first letter of

your surname, add the data of another league to your own data set:

Serie A, if 2 ∈ {,,,}

Bundesliga, if 2 ∈ {, ,,}

EPL, if 2 ∈ {, ,,}

La Liga, if 2 ∈ {,,, }

Ligue 1, if 2 ∈ {,, , }

RFPL, if 2 ∈ {, ,,, , }

It should be noted that Jürgen Klopp has found some errors in his calculations and it may therefore

happen that you receive the same league in the second allocation as in the first allocation. In this

case, Klopp cannot give you any further information about your future opponents and you will have

to randomly select one of the remaining leagues instead, based on a draw with equal probabilities.

3

When doing so, do not forget to set a seed that corresponds to the last three digits of your student

ID before the random selection.

In addition, Jürgen believes that your personal preferences will also have a great influence on the

division of the teams for the group stage. Therefore, arrange the leagues according to your personal

preferences and save them as a list of strings. In first place should be the league you like best, and

so it goes on to the last league. In this context, check whether your first preference is already part

of the data set. If not, add it to your record. Otherwise, continue with the same check for your

second and third preference. This ensures that your data set now contains the matches of football

clubs from three different leagues. Next, also check whether your last preference is part of the data

set. If so, draw a random number in the interval [0, 1] and round it to zero decimals. Do not forget

to set a seed based on the first three digits of your student ID beforehand. If the rounded number

is a “1”, Klopp says that it is very unlikely that you will have to play against a team from that

league. Therefore, you should remove the data from your record and add the data from one of the

remaining unselected leagues based on a random selection with equal probabilities.

Finally, Klopp believes that some outsiders have manipulated the data in an unfavourable way.

Therefore, he proposes to remove a certain number of rows from the data set to get the final subset

with unmanipulated data entries. In this context, you need to draw a number from the interval

[0, 500] based on a random selection with equal probabilities that tells you how many manipulated

rows you have in your dataset and should therefore remove them. Again, do not forget to set a

seed beforehand that corresponds to your entire student ID. Note that you must also randomise

the indices of the entries you are removing.

Jürgen Klopp expects you to define a function that outputs your final data set. Your function

should have four arguments: The first is a pandas DataFrame, the second is your student ID, the

third and fourth are the first letters of your first and last name respectively and the last is the list

of your ordered preferences. At the beginning of the function, please also check that all the entries

in the function are in the correct format, i.e. check that

df - is a non-epmty pandas DataFrame.

student_id - is a string or an integer consisting of 8 letters or individual numbers.

first_name_start_letter - is a string consisting of 1 letter.

last_name_start_letter - is a string consistig of 1 letter.

ordered_preferences - is a list of six strings.

If at least one of the previously defined checks fails, issue an error message, exist the function and

correct the inputs so that you can successfully apply the function. In this framework, you might

find it helpful to inform yourself about the possibility to raise a TypeError. The final dataset can

be named as mydataset or with a name of your choice.

Please use the structure shown below to set up your function. If you do not comment your code,

Jürgen will be mad at you and you will not get full points for this task. Comments in the code are

a very important part of programming so that other programmers can understand your code.

[3]: import numpy as np

def new_dataset(df,

student_id,

first_name_start_letter,

4

last_name_start_letter,

ordered_preferences):

"""

Outputs your individual data set.

:param df: A pandas DataFrame that contains the

"understat_per_game_data.csv" data.

:param student_id: An integer or a string (whatever is

easier for you) that corresponds to

your student ID.

:param: first_name_start_letter: A string that contains the first

letter of your first name.

:param: last_name_start_letter: A string that contains the first

letter of your last name.

:param: ordered_preferences: A list that contains your ordered

preferences regarding the different

leagues.

:return: A pandas DataFrame containing an individual subset of the

df DataFrame.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Subset data based on the starting letter of your first name

# [...]

# 3. Subset data based on the starting letter of your last name

np.random.seed(XXX)

# [...]

# 4. Check your first three preferences

# [...]

# 5. Check your last preference

np.random.seed(XXX)

# [...]

# 6. Remove random amount of rows

np.random.seed(XXXXXXXX)

# [...]

return my_dataset # output the new data set.

5

2.2 Task 2: Exploratory Data Analysis & Data Manipulations

2.2.1 Subtask 2.1

After you have received the relevant data set for your work, the first thing Jürgen Klopp wants to

do is to carry out a detailed review of the data properties. He is still a little afraid that outsiders

might have falsified the data so that our analyses lead to false conclusions. Therefore, as good data

analysts, we check whether we have read in the data correctly and whether there are any suspicious

observations. In this regard, write a function that checks the following properties:

1. All columns are formatted as indicated in the data description file (data_description.csv).

2. The data set does not contain any missing values.

If you find missing values, remove the corresponding rows from the data set. Jürgen says that these

instances have most likely been manipulated by an outsider. In general, your function should have

two inputs corresponding to the data set and the data description file, and it should also have two

outputs: A Pandas DataFrame with three columns [ID of the check, short description of the check

and a flag indicating whether the check failed (1) or passed (0)] and the cleaned version of your

data set.

Do not forget to check the function inputs as already done in the previous task.

Please use the structure shown below to set up your function and print out the head of both

outputs. If you do not comment your code, Jürgen will be mad at you and you will not get full

points for this task. Comments in the code are a very important part of programming so that other

programmers can understand your code.

[4]: def cleaned_dataset(df, data_descrip):

"""

Outputs the cleaned version of your individual data set.

:param df: A pandas DataFrame with your individual data.

:param data_descrip: A pandas DataFrame with the data description.

:return: A pandas DataFrame with a summary of the performed checks and

a pandas DataFrame with the cleaned input data.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Creates an empty table for the summary of your checks

# [...]

# 3. Checks if all columns are in the correct format

# [...]

# 4. Checks if the data contains missing values

# [...]

return check_summary, my_dataset # output.

6

2.2.2 Subtask 2.2

Jürgen Klopp is now sure that we have good data quality, but he still thinks that we need to

do more data preparation to facilitate our later analyses. Therefore, he asks you to also include

transformed versions of certain columns in the data set. Within this framework, please do the

following data manipulations:

1. Add another variable to the data set called “h_a_transformed” which corresponds to a nu-

merical representation of the “h_a” variable, i.e. an “h” is decoded as 1 and an “a” as 0.

2. Based on the “date” column, add three more variables to the data set called “year”, “month”

and “match_time”. The first two should correspond to the year and month when the match

took place and the third should represent the time when the match started.

3. Based on the “wins”, “draws”, “loses” and “pts” columns, for each team calculate the total

number of previous wins, draws, loses and points during the respective season and add each

of these variables to the data set. The columns should be named based on the following

structure: “cumulative_wins”, “cumulative_pts”, […].

Please use the structure shown below to set up your code and at the end print out the head of

your data set. If you do not comment your code, Jürgen will be mad at you and you will not get

full points for this task. Comments in the code are a very important part of programming so that

other programmers can understand your code.

[5]: # Transformation of "h_a" column

# [...]

# Transformation of "date" column

# [...]

# Transformation of "wins", "draws", "loses" and "pts" column

# [...]

# Print out the first five rows of your data set

# [...]

2.2.3 Subtask 2.3

Finally, to get a first impression of the distributions of the different numerical variables, Klopp would

like to see their summary statistics. Therefore, print the following information for all numerical

columns in a combined table: Count, Mean, Standard Deviation, Minimum, 25th Quantile, 50th

Quantile, 75th Quantile and Maximum.

At the end print out the head of your data set. If you do not comment your code, Jürgen will

be mad at you and you will not get full points for this task. Comments in the code are a very

important part of programming so that other programmers can understand your code.

7

2.3 Task 3: Performance Analysis on Individual Teams

2.3.1 Subtask 3.1

Jürgen Klopp now thinks that you are ready for your first performance analysis. Because you are

such a talented data analyst, you have offers from all the clubs that are in your data set and you

are in the comfortable position of deciding which club you want to sign with.

After good negotiations with your chosen club, you are happy to start the analysis. The first thing

Jürgen wants to know is how well your team has done in recent seasons, depending on whether

they have played at home or away. He also asks you to assess whether there are some trends in the

summarised data.

Recall that each row of the data set represents the game statistics of a certain football team. Hence,

you have to write a function that subsets the data based on the team you picked and creates a

DataFrame that summarizes the following information grouped by the season and the type of the

game (i.e., home or away):

The count of wins, draws and loses and,

The average goals, expected goals and expected goals against.

Then the function should also create a plot with four line graphs as subplots to visualise the results

and to simplify later interpretation:

The subplots on the left side should contain three lines showing the evolution of the number

of wins, draws and defeats in home and away matches (y-axis) over the seasons (x-axis).

The subplots on the right side should also contain three lines showing the development of

average goals, expected goals and expected goals conceded for home and away matches (y-

axis) over the seasons (x-axis).

Do not forget to include a suitable legend explaining the individual lines.

At the end the plot should look like this (you can vary the colors if you want to):

8

Do not forget to check the function inputs as already done in the previous tasks. Interpret the

results for Jürgen, e.g., the total number of wins at home has increased over the seasons and […].

Please use the structure shown below to set up your function and print out the head of the output.

If you do not comment your code, Jürgen will be mad at you and you will not get full points for this

task. Comments in the code are a very important part of programming so that other programmers

can understand your code.

[6]: import matplotlib.pyplot as plt

def team_performance(df, team_name):

"""

Outputs the summary table for subtask 3.1.

:param df: A pandas DataFrame with your individual data.

:param own_team: A string corresponding to the name of your team.

:return: A pandas DataFrame with the summary table for your team.

"""

# 1. Check if function inputs are valid

#[...]

9

# 2. Compute statistics

# [...]

# 3. Plot the results

# [...]

return summary_tab

# Print out head of summary tab

2.3.2 Subtask 3.2

Jürgen Klopp now has an idea of how well our team played during last seasons, but he also wants

to be prepared for the next game. It will be a regular league match, so you have to pick another

team from the same league. In this context, he asks you to analyse the tactical behaviour of our

next opponent. Based on his experience, Klopp knows that a team usually changes its tactics over

the course of a season, so you need to group the data by the year and month in which the matches

took place. Jürgen also knows that the team managers of our opponent have changed frequently

in the past and therefore only the data of the last available season shows the tactical behaviour of

the current coach. Based on the previous information, write a function that subsets the data and

groups it accordingly, then it should calculate the following quantities:

The average power of pressure, power of opponent’s pressure, deep and deep allowed.

Then create a line plot to visualise the results and to simplify later interpretation:

The plot should contain four lines showing the evolution of the average power of pressure,

power of opponent’s pressure, deep and deep allowed (y-axis) over the season (x-axis).

At the end the plot should look like this (you can pick different colors if you want to):

10

Do not forget to check the function inputs as already done in the previous tasks. Interpret the

results for Jürgen, e.g., over the season the opponent team has increased its power of pressure and

[…].

Please use the structure shown below to set up your function and print out the head of the output.

If you do not comment your code, Jürgen will be mad at you and you will not get full points for this

task. Comments in the code are a very important part of programming so that other programmers

can understand your code.

[7]: def opponent_performance(df, opponent_name, season):

"""

Outputs the summary table for subtask 3.2.

:param df: A pandas DataFrame with your individual data.

:param opponent_name: A string corresponding to the name of your opponent?

?team.

:param season: An integer corresponding to the starting year of the?

?desired season.

:return: A pandas DataFrame with the summary table for the opponent team.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Summarize and prepare the data for plotting

# [...]

# 3. Plot the results

# [...]

return summary_tab

# Print out head of summary table

2.3.3 Subtask 3.3

Finally, Jürgen Klopp asks you for one more favour. He feels that in recent years our team’s

performance and that of our opponents has depended on the starting time of the game, i.e. our

team has won more games in the evening than at noon. For this purpose, you need to write a

function that creates a plot that contains two different bar charts as subplots:

One for our team and

One for our next opponent.

The two plots should present the following information:

On the x-axis should be the distinct starting times of the games and

On the y-axis the count of wins, draws and loses.

11

At the end the plot should look like this (you can pick different colors if you want to):

Do not forget to check the function inputs as already done in the previous tasks. Interpret the

results and check whether Jürgen was right with his feeling, e.g., our opponent normally wins more

games at noon and […].

Please use the structure shown below to set up your function and print out the head of the output.

If you do not comment your code, Jürgen will be mad at you and you will not get full points for this

task. Comments in the code are a very important part of programming so that other programmers

can understand your code.

[8]: def game_time_performance(df, team_name, opponent_name, season):

"""

Outputs the two summary tables for subtask 3.3.

:param df: A pandas DataFrame with your individual data.

:param team_name: A string corresponding to the name of your own team.

:param opponent_name: A string corresponding to the name of your opponent?

team.

:param season: An integer corresponding to the starting year of the?

desired season.

:return: A pandas DataFrame with the summary table for your own team,

a pandas DataFrame with the summary table for the opponent team.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Prepare the data for plotting

# [...]

# 3. Plot the results

12

# [...]

return summary_tab_own, summary_tab_opp

# Print out head of summary tables

2.4 Task 4: Performance Analysis on Individual Leagues

Since our team will play in the Champions League next season, Jürgen Klopp also wants to be

prepared for international matches. Therefore, he is interested in a tool that gives him the same

analysis as in subtasks 3.1 and 3.2, but for a specific league rather than for an individual team. He

wants to use the tool to evaluate the overall strength of the league and to get an idea of what kind

of football is played in this country.

Therefore, you need to write a function for Klopp that receives a data frame and a league name as

inputs and outputs the two summary tables as in subtasks 3.1 and 3.2 for the appropriate league.

Finally, apply the function to all the leagues in your data set and interpret the results. In particular,

try to find differences in the tactical behaviour of the teams in the individual leagues, e.g. in the

Bundesliga the home teams have on average a higher pressure strength than in the other available

leagues.

Do not forget to check the function inputs as already done in the previous tasks. Please use the

structure shown below to set up your function and print out the head of the output. If you do

not comment your code, Jürgen will be mad at you and you will not get full points for this task.

Comments in the code are a very important part of programming so that other programmers can

understand your code.

[9]: def performance_analysis(df, league_name, season):

"""

Outputs the summary tables of subtasks 3.2 and 3.3 for an individual league?

and season.

In the framework of this function also the corresponding plots are printed.

:param df: A pandas DataFrame with your individual data.

:param league_name: A string corresponding to a league name.

:param season: An integer corresponding to the starting year of the

desired season.

:return: A pandas DataFrame with the summary table of subtask 3.2,

a pandas DataFrame with the summary table of subtask 3.3.

"""

# 1. Check if function inputs are valid

# [...]

# 2. Prepare the data for plotting

# [...]

13

# 3. Plot the results

# [...]

return summary_tab_1, summary_tab_2

# Compute and print out results

# [...]

2.5 Task 5: Game Prediction

Ultimately, Jürgen Klopp wants to have a model that predicts whether a team will win a match

based on the available match statistics. His idea is to apply this model during a match so that he

can change the tactical behaviour of our team in the right way. He knows that you are an expert

in linear programming and that you have used the library PuLP a lot. He has heard about the

linear Support Vector Machine and wants to challenge your data analyst skills a little. Therefore,

he gives you the task of calculating such a model for the given data set using CVXPY. CVXPY is

another optimisation library in Python, similar to PuLP, but it is designed for solving general

convex optimisation problems. You might find helpful information about the new library on this

webpage: https://www.cvxpy.org/.

The basic idea of a linear Support Vector Machine (SVM) is to find the maximum-margin hyper-

plane that separates two clases, i.e., wins and no wins, while allwoing for some violations controlled

by a parameter . A graphical representation based on two independet variables can be found

below (Source: Chapter 5 of the Python Data Science Handbook by Jake VanderPlas):

This general idea then can be formulated as a soft-margin classification problem with the following

objective function that we want to minimize:

max(0, 1 (wx )) + ‖w‖2

where ∈ ? represents the total number of games, x ∈ ? a vector containing the ∈ ?

independent variables available for a single game ∈ {1,… , }, ∈ {1,?1} the decoded target

variable (i.e., 1 for a win and -1 for not a win), w ∈ ? the vector containing the parameters for

the independent variables, ∈ ? is the intercept and > 0 is the previously mentioned penalty

14

parameter controlling the trade-off between a larger margin and the correct classification of the

instances.

By introducing individual slack variables = max(0, 1? (wx?)) for each instance the pre-

vioulsy defined objective function can be transformed into the following constrainted minimization

problem

Read the general documentation of CVXPY and solve the optimisation task in such a way that the

team leader gets his tool. Print your solution and the optimal unconstrainted objective function

value. Compare the predictions with the actual values and print the percentage of correctly classified

matches.

Note that predictions can be made by using the sign function:

sign(wx ? )

You should use the following columns as independet variables: h_a_transformed, xG, npxG,

npxGA, deep, deep_allowed, missed, npxGD, ppda_coef, ppda_att, ppda_def, oppda_coef,

oppda_att, oppda_def, xGA_diff, cumulative_wins, cumulative_draws, cumulative_loses and

cumulative_pts. Further, to train your model you should use 80% of the available data that you

randomly select. In addition, you must standardise all variables beforehand, as the Support Vector

Machine is sensitive to the scaling of the data. Therefore, subtract the empirical mean from each

variable and then divide it by the empirical standard deviation. The remaining data should be

used as testing data to check the out-of-sample performance. Do not forget to apply the same

standardization to the testing data.

If you are interested in getting a deeper insight into how the Linear Support Vector Machine works,

take a look at Chapter 5 of the Python Data Science Handbook by Jake VanderPlas.

Please use the structure shown below to set up your code and print out the value of objective

function, the percentage of correctly classified training and testing observations. If you do not

comment your code, Jürgen will be mad at you and you will not get full points for this task.

Comments in the code are a very important part of programming so that other programmers can

understand your code.

Hint: CVXPY needs all inputs as numpy.array. For the C parameter you can try different values

and see which gives you the best objective function value, a good starting point might be values

between (0, 1].