代写AM11 Individual Assignment 2024代写留学生Python语言

- 首页 >> C/C++编程

AM11 Individual Assignment 2024

Make a prediction on whether a U$ - PRICE INDEX has positive return next day, by treating it as a classification problem.

# -----------------------------------------------------------------------------

CLASS LABELS, y (stored in 'target.pkl'):

buy (+1 i.e. positive return), no trade (0)

Label has been shifted 1 day ahead i.e. provided label y (from 'target.pkl') is a result of the operation: y = y_original.shift(-1)

Example: for date 2000-01-03, the corresponding label indicates what happened  to the Index the next day (i.e. +1 means next day 2000-01-04 index went up. This label is stored against 2000-01-03 in order to learn day ahead prediction).

Summary: you do not need to shift the label, both X and y are already aligned.

# -----------------------------------------------------------------------------

DATA (X, stored in 'design_matrix.pkl'):

The design matrix (X) is provided for you & contains TIME SERIES DATA:

•   returns of world indices

•   returns of commodity indices

•   global interest rates

•   bonds

•   technical indicators data

•   dummy variables

The data does not contain any features representing individual stocks.

The data has already been cleaned, and there are no missing values and price data has been converted to returns to make it stationary (this is a necessary requirement when making predictions on financial data) .

Any column that contains only 0 or 1 is a dummy variable.

Features are in different units.

Note: that the feature names have been anonymised because this is a real life dataset used for a trading algorithm.

# -----------------------------------------------------------------------------

TASKS: Carryout the following steps for your systematic algorithm analysis:

1) Split the data into train, validate, test sets

- Keep the LAST 3 months of data for TEST data: i.e. 65 most recent days

- The rest split into 80% TRAIN (1st part of the data) and 20% VALIDATION (remaining part of the data except test data)

Think carefully about the kind of data you're working with: TIME SERIES DATA. Will your data need to be randomly shuffled during the train/val/test split?

2) You will need to decide between competing models i.e. which one should be used for your trading decisions: Support Vector Machine (SVM), Artificial Neural

Network (ANN), Recurrent Neural Network (RNN), or Convolutional Neural Network (CNN). You will also need to tune each model's hyperparameters using validation

data.

You should also decide if you should use data in its original form or reduce it using  PCA first. You may wish to investigate if a particular window size of the data is best for training the trading model (e.g. instead of using the full 20 years worth of data, should you use 10, 5, 2 etc?) Based on the results of your checks using the

validation data performance you should select the best data preprocessing (original vs PCA dimensionally reduced features) and the model. Subsequently, find the final test performance on the unseen test data set. Suggested steps:

2a) Run each classifier to observe validation data performance:

- hypertune the parameters of the classifiers of your choice (at minimum obtain 2 competing models: an SVM model and a Neural Network of your choice) on

validation data set.

- Think carefully about what measure you will base your optimal hyperparameter selection on: when putting on a trade we DO NOT WANT TO LOOSING MONEY, and also, if in a period of time (say a year), there are 125 trading opportunities, we don't just want to trade on say 10 of them as your algorithm would be quite stagnant, we want to recognise a reasonable amount of trades correctly (this balance between recognising trades correctly and not loosing money is a fine one, and one that needs to be considered).

2b) Dimensionality of your Design Matrix is rather high (>500 features). It may potentially benefit the classifier if the dimensionality of your data is reduced.

Investigate whether it is better to use the original data or the PCA reduced data.

Decide on which features should be reduced with PCA and obtain the scores

- are there any columns with many 0s? Will these be helpful to the PCA? If not will you remove them all together or do you think they’ll be useful for the classifier?

- plot the correlation matrix as a heatmap to decide if PCA will help. Report your findings from observing the corr matrix in a doc string in your python script.

- should data be centered ahead of PCA on this data set?

- should data be scaled ahead of PCA on this data set?

Using variance analysis decide on how many PCs to keep. Note you are free to use as many PCs as you see fit, e.g. you may wish to keep more or less PCs than the

60-80% rule of thumb we discussed in class.

Obtain a new design matrix based on joining features from:

- the scores obtained from the PCA

- any original design matrix's features you may have decided to set aside (and not process through the PCA but keep for analysis)

2c) You have been given more than 20 years worth of data. You may wish to

investigate which window of time is best suited for the prediction task using

the validation data set (e.g. should you use the most 10 recent years for prediction?

5? 2? etc).

3) Once you choose you optimal data (which duration dataset worked best, if you chose to examine this, whether you will use PCA reduced data or not etc):

- using your best model (with its optimal hyperparameters), combine the train and validation set into a single set and re-train the model using this train_val data.

- Test the performance of your classifier on the unseen test data.

- For the best chosen model report in a doc string: train, validate (for best

hyperparameters), and test set results. Show the classification report such that accuracy, precision and recall of your classifier are visible.

# -----------------------------------------------------------------------------

INSTRUCTIONS :

- You must document your code at all times

- Feel free to use GPT but be aware that plain copy paste will not give you marks. You must understand your code and demonstrate understanding of how

to select among competing models, and of the principals of building machine learning algorithms.

- Ensure to provide results at each stage of your analysis as a docstring (i.e.

simply copy paste output of your code into a docstring next to significant

sections where you obtain results/calculate variables, e.g. what optimal

hyperparameters were learnt for each classifier?). This makes grading easier.

- If you are using a random number generator in any of your work, ensure to fix the random number generator with a seed, for reproducibility of results during marking.

GRADING:

- We will look at your methodology and demonstration of understanding of the techniques used.

- We will pay attention to the performance of your model, however it won’t be the

main factor deciding if you have done a good job. A thorough understanding of how to set up machine learning model selection and showing understanding of how to use the validation data to find the best hyperparameters will have more priority.




站长地图