辅导F20ML编程、Python程序调试、Python语言编程辅导 讲解留学生Processing|讲解留学生Prolog
- 首页 >> Java编程 F20ML 2020-21: Coursework 2 [25%]
In this coursework, we will look at three different topics: effectively preprocessing and feature
engineering a text corpus, implementing the Perceptron Algorithm and implementing the
Gradient Descent algorithm for a Linear Classifier in order to perform binary classification.
The task we will attempt is sentiment classification of sentences from movie reviews, using a
classic corpus by Pang and Lee (2004) .
1
The coursework consists of two parts: i) this description, ii) an accompanying folder with
the dataset and several .py files containing the necessary scaffolding for you to implement
the missing parts in order to answer the questions below. You will NOT have to submit a
document report in .pdf or .doc but instead you will enter your responses (upload the
completed .py files in a zipped folder and a couple of plot images) in the
corresponding Test created on Vision. Read the instructions in the last Section
‘Deliverables’ below for more details.
Note: We will use the existing implementations for text preprocessing and vectorisation from
the Python library scikit-learn, and we will implement the Perceptron and Gradient Descent
Algorithms from scratch.
Introduction
You have just started working in the advertisement department of a media conglomerate as
a Data Science Consultant. One of your team’s jobs is to sieve through reviews written by
users about TV shows, movies, and radio broadcasts on your online platform, and moderate
them based on polarity (positive/negative sentiment). So far the team has been doing this
job...manually! :(
The data scientist in you cannot tolerate this anymore…
You decide to take some action and build a classifier that reads each review sentence by
sentence and assigns a positive or negative class. Hm… that sounds familiar! You
immediately think of one of those classic datasets from Pang and Lee on sentiment
classification (movie reviews from rotten tomatoes) and download it.
...you have already started your favourite IDE and have imported scikit-learn and numpy!
1 Bo Pang and Lillian Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts, Proceedings of the ACL, 2004.
Structure of the Code
We strongly suggest you create a project via Spyder IDE or PyCharm. The structure of the
code should look like this:
a) The main entry point (where the __main__ block of code is implemented) is inside
sentiment_classifier.py. There we get to load the various dataset splits via
load_dataset(), create different vector representations of our inputs via
compute_features(), create an instance of our own perceptron algorithm
implementation (Perceptron class in models/perceptron.py), train via
Perceptron.train() and predict via Perceptron.predict() with it, as well
as create an instance of our own gradient descent algorithm for a linear model using
the squared loss and L2-norm regulariser, train via GD.train() and predict via
GD.predict().
b) The dataset reader class ReviewsDataset is inside
readers/reviews_dataset.py. This is very similar (actually simpler) to what
we’ve seen in the previous coursework.
c) Our implementation of the perceptron algorithm goes inside the class Perceptron
of models/perceptron.py (surprise, surprise!). You will need to finish the
implementation of Perceptron.train() and Perceptron.predict(); see in
Part II how.
d) Our implementation of the Gradient Descent Algorithm goes inside the class GD of
models/gd.py. You will need to finish the implementation of GD.train() and
GD.predict(). It’s going to be very similar to Perceptron; Part III explains how.
Part I - Feature Engineering [7 marks]
The dataset (train/dev/test splits) should already be in place inside the source code project
directory data/. A short description of the dataset can be found here.
Load the data inside __main__ in sentiment_classifier.py using the dataset loader
provided for you (function load_dataset() in sentiment_classifier.py). Have a
look at the object structure by printing out a few examples, inspecting with the debugger and
by looking at the code (ReviewsDataset in readers/reviews_dataset.py). It
should consist of three arrays: X, X_truecased and y.
- X is a list of examples. Each example corresponds to another list containing the
words of every sentence, lower-cased.
- X_truecased is identical to X but the sentences have been “true-cased”
automatically, i.e., we have tried to recover the original case of words (e.g., beginning
of sentence, proper names, etc).
- y is a list of labels [positive, negative], one for each example.
Note: The dataset has been already tokenised by its authors, so you don’t need to use
sent_tokenize() from NLTK.
OK! Let’s start thinking of good feature representations!
We will do this in two steps: encoding features, and more advanced representation features.
Note: We will be using the term features and feature templates interchangeably; you can
consider the latter as a “function” that generates similar-looking features (e.g., all unigram
occurrences is a feature template that results in a feature vector which has the size of the
vocabulary, and binary values for the words (or unigrams) that are present in a sentence).
One standard process (which is based on our intuition and expectation of what each feature
should contribute to) is the following (use the MultinomialNB as your main classifier
without any parameters; this is an implementation of Naive Bayes that is known to work very
well for text classification tasks such as this one):
for every feature template in list_of_features do
create feature vector representation
train classifier -> predict on dev -> compute F-1 score
choose best feature template based on F-1 score
As you might realise this seems a reasonable strategy, but it doesn’t take into account the
combination of feature templates. This is where your intuition should come into the game.
Which feature templates should you try to combine (i.e., apply in your Pipeline together)
and expect to get an increase in F-1 score?
Of course, you could try all of the combinations in a Grid Search fashion (using
GridSearchCV). But that could take several hours...
Note: What we recommend and ask you to implement is the bare minimum. However, this
part is a bit more research-y, so feel free to experiment with anything else that comes to
mind and you feel capable of implementing!
[implement compute_features() in sentiment_classifier.py]
Encoding Features
1. Implement each of the following feature templates; you should only need to use
CountVectorizer, and TfIdVectorizer. Check Week 5’s Lab Sheet and the
documentation of how to use and set the corresponding parameters). Report
directly on the Vision test (Coursework 2) the Accuracy, and provide a very
brief justification why you think the Accuracy changed, and comment on any
feature combinations you try out. The bare minimum you are requested to try
are the templates in (a-e).
Notes:
- The first step is the simplest vector representation and will be considered as
our baseline, i.e, the point of reference for all future experiments. Hopefully,
everything else should (in theory) have a higher score. If not, either our
intuition didn’t work (don’t be surprised how often this happens), or we have a
bug! Don’t worry about the former case, but obviously double check you don’t
have a bug first, before moving on.
- You will also have to decide on combinations of features, i.e., which feature
templates to leave “on” before going to the next one. Don’t worry about
making an exhaustive search; use your intuition. Of course, you will notice
that some of the feature templates are mutually exclusive.
a. Represent only word occurrences with binary values (baseline)
b. Remove stop words (these are the most common words in the English
language).
c. Represent term frequencies of words.
Another refinement on top of term frequencies is to downscale weights for words that
occur in many examples in the corpus and are therefore less informative than those
that occur only in a smaller portion of the corpus. This downscaling is called tf–idf for
“Term Frequency times Inverse Document Frequency”.
d. Represent words using tf-idf.
Sometimes using low-frequency words can hurt the performance of our classifier, as
there aren’t too many examples that contain them. The solution is to completely
remove them from our vocabulary before creating any features. The question to ask
is: which ones should we remove? A common technique is to apply a threshold on
the occurrences and remove any that falls below that. E.g., if we set the
threshold=1 we will remove all words from our vocabulary that occur only once.
e. Apply a threshold to low-frequency words. Also, report the size of the
vocabulary that remains. Consider the threshold as a hyperparameter. Hint:
Since this is a fairly small dataset don’t try too big thresholds!
Representation Features
2. Most likely the following feature templates should be used in combination with the
ones you tried above. Because by design they are attempting to capture more
“context” in principle they should add to our overall performance. However, since
they increase the feature space immensely you might not have enough data in the
end to attain the results you want. How will you find out? Well, by reporting the
Accuracy, and providing a small justification why you think the Accuracy changed,
and comment on any feature combinations you try out. Report directly on the
Vision test (Coursework 2) similarly to Part I - Question 1.
a. Represent sentences as a bag of bigrams, or as a bag of trigrams instead of
unigrams, which is what all of the Encoding Features were doing.
b. Try also combinations of ngram templates.
Report Results
Time to make a decision. Which feature combination should you choose?
3. Collect all your results and present them in a table (rows represent each experiment,
column contain Accuracy score). Choose the best feature combination and briefly
justify your answer.
Part II - Perceptron Algorithm [9 marks]
[Implement the (rest of the) body of train(), init_parameters() and predict()in
models/perceptron.py]
Now it’s finally the time to get your hands dirty: let’s actually implement a Machine Learning
algorithm from scratch! Hint: Take some time to familiarise yourself with the scaffolding
implementation of Perceptron class, and how it’s getting initialised from main() in
sentiment_classifier.py before you start implementing!
1. You will have to implement both the training and prediction methods for the
Perceptron Algorithm according to the following pseudocode (You don’t have to
answer any part on the Vision test; we will inspect the code you will upload.)
Algorithm 1: train(train_X, train_y, dev_X, dev_y)
1: init_parameters() // initialise parameters
2: for epoch = 1 … num_epocs do // for every epoch
3: preds ← []
4: for all (x, y) ∈ train_X, train_y do
5: a ← ∑ x // compute activation for this example
D
d=1
wd d + b
6: yˆ = sign(a) // predict the label
7: preds ← preds + yˆ // append the predicted label
8: if ya ≤ 0 then // we made a mistake
9: w ← w + yx // update weights
10: b ← b + y // update bias
11: end if
12: end for
13: end for
Algorithm 2: init_parameters()
1: w , for all d = 1 … D // initialise weights d ← 0
2: b ← 0 // initialise bias term
Algorithm 3: predict(X)
1: preds ← []
2: for all x ∈ X do
3: a ← ∑ x
D
d=1
wd d + b // compute activation for this example
4: yˆ = sign(a) // predict the label
5: preds ← preds + yˆ // append the predicted label
6: return preds
-
Now is the time to take your algorithm for a spin… Use as input vectors (train_X and dev_X)
the best feature representation from Part I, in order to train your perceptron.
2. Perform hyperparameter tuning.
a. Plot the training and development accuracy curves (in a single plot) against
your hyperparameter values (Just upload the file).
b. Briefly explain what is going on in this plot.
Note: Think of what is missing from the Pseudocode in Algorithm 1, in order to
perform this step.
Extra Features
Randomly shuffling the training set before you start a new epoch usually helps (e.g.,
especially when dealing with irrelevant features) and makes the model converge faster.
3. (a) Implement shuffling, compute train/dev accuracy curves and add them to the plot
you computed in the question above. (Upload a new file). (b) Answer very briefly (in
one line) if it helps.
A few useful tips (Don’t skip them!):
- Use the method safe_sparse_dot from scikit-learn (we’ve done the import for
you) when computing dot products with the input vector x. The reason is that the
Vectorizer functions you will use in Part I to create feature vectors, output a sparse
vector for every example (i.e., only contains values for non-zero elements).
- Be careful when computing the labels! Remember that you should output either -1
(negative class), or +1 (positive class).
- Debug your ML algorithm! Don’t just implement the pseudocode… Here are a few
typical steps (based on our experience):
- Inspect the value of the activation and the values of the weights/bias after a
perceptron update step on a single example. Do they do the right thing?
- Try first running your algorithm for 1 epoch; does it give you any errors? Is the
accuracy reasonable?
- What is the accuracy of a majority classifier (simply count the occurrences of
each class, pick the biggest and divide by the total number)? Does your
model do better than that? (We know it should)
- Overfit your model on a very small dataset. Use as training set a very small
portion of train_vecs and make sure your accuracy is increasing. Use the
following dataset sizes=1,2,5,10,100. For sizes=1,2 (maybe bigger too) your
accuracy should be 100%. If not, probably you have a bug!
Part III - Gradient Descent Algorithm [9 marks]
[Implement the (rest of the) body of train(), init_parameters() and predict()in
models/gd.py]
Let’s try to implement a different algorithm: Gradient Descent for a Linear Model using the
Squared Loss and L2-norm Regulariser. Note: Do not expect it to be as optimised as the
scikit-learn implementation (SGDClassifier). This is the reason why we will only train on a
smaller portion of the dataset (10000 examples).
1. Compute the gradients wrt to the weight vector w and bias term b. Enter just the
resulting terms on your response (you don’t need to show the derivation steps).
2. Implement the Gradient Descent Algorithm by adapting the pseudocode found in
Slides 14 and 27 of Week 4. Note: Careful! Slide 27 has an implementation using the
Exponential Loss.
Use as input vectors (train_X and dev_X) the best feature representation from Part I, in order
to train your linear model.
3. Perform hyperparameter tuning. Answer the following questions on the Vision Test.
a. Plot the training and development accuracy curves (in a single plot) against
max_iter for 3 combinations of eta and λ. (Upload the file with the plot that
should contain exactly 6 curves).
b. Briefly explain what is going on in this plot.
c. Plot the training and development average loss curves (in a single plot)
against max_iter for the same combinations as in Part III - 3a above. The
average loss is simply taking the average of the total loss of all examples in
the dataset.
Tips: This is a bit more convoluted since you have 3 hyperparameters (welcome to
the real world of Machine Learning!): max_iter, eta (η) and lambda (λ). You could try
a full grid search but it is going to be very slow. We advise you to set max_iter to
something low (e.g., 10 or 20) and leave it there. Then try with moderate eta (e.g.,
0.0001) and no regularisation (λ=0) and progressively increase λ (don’t go too far).
Try big changes to eta (e.g., 0.001, 0.0001, 0.00001) and do not try all the
intermediate values. When experimenting do not be afraid to “kill” an experiment
early if you see the accuracy not increasing (or loss not decreasing)… Your biggest
enemy here is overshooting (take my word for it, it’s very common!). Finally, at this
stage just print out the accuracies/losses (so you can “kill” the experiment if it
misbehaves) before going to the plots.
The moment of truth...
4. Finally, use the best Perceptron model from Part II and best Linear Model and predict
on the test set. Answer the following questions (very briefly -a few lines) on the Vision
Test:
a. What are their accuracies?
b. Is the comparison fair?
c. How do the two models compare to each other?
Deliverables
You will have to submit the Test in ‘Assignments>Coursework 2 - Test’ on Vision. Over
there, you should (i) upload the completed source code in a .zip file: PLEASE don’t include
the dataset files! (it should be named: Coursework_2_H0XXXXXXX.ipynb, by replacing
the Xs with your Student ID) , (ii) enter your answers to the questions outlined in the
description above, (iii) upload a few of plot images as part of your answers clearly indicated
within the test. It is advisable to properly comment your code. Note: Any 3rd party source
used must be properly cited (see also below).
Important Notes!
- The assignment counts for 25% of the course assessment.
- You are permitted to discuss the coursework with your classmates and of course with
me and the teaching assistant. However, coursework reports must be written in your
own words and the accompanied code must be your own. If some text or code in the
coursework has been taken from other sources, these sources must be properly
referenced. In particular, for pieces of code you get from the web (e.g., from
StackOverflow), minimally you should provide the link where you found it as an inline
comment in the jupyter notebook. Failure to reference work that has been obtained
from other sources or to copy the words and/or code of another student is
plagiarism and if detected, this will be reported to the School's Discipline Committee.
If a student is found guilty of plagiarism, the penalty could involve voiding the course.
- You should never give hard or soft copies of your coursework report or code to
another student. You must always refuse any request from another student for a
copy of your report and/or code. Sharing a coursework report and/or code with
another student is collusion, and if detected, this will be reported to the School's
Discipline Committee. If found guilty of collusion, the penalty could involve voiding
the course. You can find more information about the University’s policy on Plagiarism
here: https://www.hw.ac.uk/uk/students/studies/examinations/plagiarism.htm
- Pay special attention to all the Labs as they should provide lots of insight as to how
to tackle most of the questions in this coursework. The Final Lab is specifically
designed for you to ask any questions related to the coursework. Also, CHECK THE
DOCUMENTATION OF scikit-learn when you are in doubt.
- The test should be submitted by 15:00 UK time on Friday 13 November. You will
find the coursework under the ‘Assessment’ section of the course Vision page. No
individual extensions are permitted under any circumstances. Students who submit
after the deadline but within 5 working days of the deadline will be award
0.7x(awarded coursework mark). Submissions that are more than 5 days late will
receive 0 marks.
- You will receive the final mark, answers, sample code solution and cohort-wide
feedback no later than 15 working days.
In this coursework, we will look at three different topics: effectively preprocessing and feature
engineering a text corpus, implementing the Perceptron Algorithm and implementing the
Gradient Descent algorithm for a Linear Classifier in order to perform binary classification.
The task we will attempt is sentiment classification of sentences from movie reviews, using a
classic corpus by Pang and Lee (2004) .
1
The coursework consists of two parts: i) this description, ii) an accompanying folder with
the dataset and several .py files containing the necessary scaffolding for you to implement
the missing parts in order to answer the questions below. You will NOT have to submit a
document report in .pdf or .doc but instead you will enter your responses (upload the
completed .py files in a zipped folder and a couple of plot images) in the
corresponding Test created on Vision. Read the instructions in the last Section
‘Deliverables’ below for more details.
Note: We will use the existing implementations for text preprocessing and vectorisation from
the Python library scikit-learn, and we will implement the Perceptron and Gradient Descent
Algorithms from scratch.
Introduction
You have just started working in the advertisement department of a media conglomerate as
a Data Science Consultant. One of your team’s jobs is to sieve through reviews written by
users about TV shows, movies, and radio broadcasts on your online platform, and moderate
them based on polarity (positive/negative sentiment). So far the team has been doing this
job...manually! :(
The data scientist in you cannot tolerate this anymore…
You decide to take some action and build a classifier that reads each review sentence by
sentence and assigns a positive or negative class. Hm… that sounds familiar! You
immediately think of one of those classic datasets from Pang and Lee on sentiment
classification (movie reviews from rotten tomatoes) and download it.
...you have already started your favourite IDE and have imported scikit-learn and numpy!
1 Bo Pang and Lillian Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts, Proceedings of the ACL, 2004.
Structure of the Code
We strongly suggest you create a project via Spyder IDE or PyCharm. The structure of the
code should look like this:
a) The main entry point (where the __main__ block of code is implemented) is inside
sentiment_classifier.py. There we get to load the various dataset splits via
load_dataset(), create different vector representations of our inputs via
compute_features(), create an instance of our own perceptron algorithm
implementation (Perceptron class in models/perceptron.py), train via
Perceptron.train() and predict via Perceptron.predict() with it, as well
as create an instance of our own gradient descent algorithm for a linear model using
the squared loss and L2-norm regulariser, train via GD.train() and predict via
GD.predict().
b) The dataset reader class ReviewsDataset is inside
readers/reviews_dataset.py. This is very similar (actually simpler) to what
we’ve seen in the previous coursework.
c) Our implementation of the perceptron algorithm goes inside the class Perceptron
of models/perceptron.py (surprise, surprise!). You will need to finish the
implementation of Perceptron.train() and Perceptron.predict(); see in
Part II how.
d) Our implementation of the Gradient Descent Algorithm goes inside the class GD of
models/gd.py. You will need to finish the implementation of GD.train() and
GD.predict(). It’s going to be very similar to Perceptron; Part III explains how.
Part I - Feature Engineering [7 marks]
The dataset (train/dev/test splits) should already be in place inside the source code project
directory data/. A short description of the dataset can be found here.
Load the data inside __main__ in sentiment_classifier.py using the dataset loader
provided for you (function load_dataset() in sentiment_classifier.py). Have a
look at the object structure by printing out a few examples, inspecting with the debugger and
by looking at the code (ReviewsDataset in readers/reviews_dataset.py). It
should consist of three arrays: X, X_truecased and y.
- X is a list of examples. Each example corresponds to another list containing the
words of every sentence, lower-cased.
- X_truecased is identical to X but the sentences have been “true-cased”
automatically, i.e., we have tried to recover the original case of words (e.g., beginning
of sentence, proper names, etc).
- y is a list of labels [positive, negative], one for each example.
Note: The dataset has been already tokenised by its authors, so you don’t need to use
sent_tokenize() from NLTK.
OK! Let’s start thinking of good feature representations!
We will do this in two steps: encoding features, and more advanced representation features.
Note: We will be using the term features and feature templates interchangeably; you can
consider the latter as a “function” that generates similar-looking features (e.g., all unigram
occurrences is a feature template that results in a feature vector which has the size of the
vocabulary, and binary values for the words (or unigrams) that are present in a sentence).
One standard process (which is based on our intuition and expectation of what each feature
should contribute to) is the following (use the MultinomialNB as your main classifier
without any parameters; this is an implementation of Naive Bayes that is known to work very
well for text classification tasks such as this one):
for every feature template in list_of_features do
create feature vector representation
train classifier -> predict on dev -> compute F-1 score
choose best feature template based on F-1 score
As you might realise this seems a reasonable strategy, but it doesn’t take into account the
combination of feature templates. This is where your intuition should come into the game.
Which feature templates should you try to combine (i.e., apply in your Pipeline together)
and expect to get an increase in F-1 score?
Of course, you could try all of the combinations in a Grid Search fashion (using
GridSearchCV). But that could take several hours...
Note: What we recommend and ask you to implement is the bare minimum. However, this
part is a bit more research-y, so feel free to experiment with anything else that comes to
mind and you feel capable of implementing!
[implement compute_features() in sentiment_classifier.py]
Encoding Features
1. Implement each of the following feature templates; you should only need to use
CountVectorizer, and TfIdVectorizer. Check Week 5’s Lab Sheet and the
documentation of how to use and set the corresponding parameters). Report
directly on the Vision test (Coursework 2) the Accuracy, and provide a very
brief justification why you think the Accuracy changed, and comment on any
feature combinations you try out. The bare minimum you are requested to try
are the templates in (a-e).
Notes:
- The first step is the simplest vector representation and will be considered as
our baseline, i.e, the point of reference for all future experiments. Hopefully,
everything else should (in theory) have a higher score. If not, either our
intuition didn’t work (don’t be surprised how often this happens), or we have a
bug! Don’t worry about the former case, but obviously double check you don’t
have a bug first, before moving on.
- You will also have to decide on combinations of features, i.e., which feature
templates to leave “on” before going to the next one. Don’t worry about
making an exhaustive search; use your intuition. Of course, you will notice
that some of the feature templates are mutually exclusive.
a. Represent only word occurrences with binary values (baseline)
b. Remove stop words (these are the most common words in the English
language).
c. Represent term frequencies of words.
Another refinement on top of term frequencies is to downscale weights for words that
occur in many examples in the corpus and are therefore less informative than those
that occur only in a smaller portion of the corpus. This downscaling is called tf–idf for
“Term Frequency times Inverse Document Frequency”.
d. Represent words using tf-idf.
Sometimes using low-frequency words can hurt the performance of our classifier, as
there aren’t too many examples that contain them. The solution is to completely
remove them from our vocabulary before creating any features. The question to ask
is: which ones should we remove? A common technique is to apply a threshold on
the occurrences and remove any that falls below that. E.g., if we set the
threshold=1 we will remove all words from our vocabulary that occur only once.
e. Apply a threshold to low-frequency words. Also, report the size of the
vocabulary that remains. Consider the threshold as a hyperparameter. Hint:
Since this is a fairly small dataset don’t try too big thresholds!
Representation Features
2. Most likely the following feature templates should be used in combination with the
ones you tried above. Because by design they are attempting to capture more
“context” in principle they should add to our overall performance. However, since
they increase the feature space immensely you might not have enough data in the
end to attain the results you want. How will you find out? Well, by reporting the
Accuracy, and providing a small justification why you think the Accuracy changed,
and comment on any feature combinations you try out. Report directly on the
Vision test (Coursework 2) similarly to Part I - Question 1.
a. Represent sentences as a bag of bigrams, or as a bag of trigrams instead of
unigrams, which is what all of the Encoding Features were doing.
b. Try also combinations of ngram templates.
Report Results
Time to make a decision. Which feature combination should you choose?
3. Collect all your results and present them in a table (rows represent each experiment,
column contain Accuracy score). Choose the best feature combination and briefly
justify your answer.
Part II - Perceptron Algorithm [9 marks]
[Implement the (rest of the) body of train(), init_parameters() and predict()in
models/perceptron.py]
Now it’s finally the time to get your hands dirty: let’s actually implement a Machine Learning
algorithm from scratch! Hint: Take some time to familiarise yourself with the scaffolding
implementation of Perceptron class, and how it’s getting initialised from main() in
sentiment_classifier.py before you start implementing!
1. You will have to implement both the training and prediction methods for the
Perceptron Algorithm according to the following pseudocode (You don’t have to
answer any part on the Vision test; we will inspect the code you will upload.)
Algorithm 1: train(train_X, train_y, dev_X, dev_y)
1: init_parameters() // initialise parameters
2: for epoch = 1 … num_epocs do // for every epoch
3: preds ← []
4: for all (x, y) ∈ train_X, train_y do
5: a ← ∑ x // compute activation for this example
D
d=1
wd d + b
6: yˆ = sign(a) // predict the label
7: preds ← preds + yˆ // append the predicted label
8: if ya ≤ 0 then // we made a mistake
9: w ← w + yx // update weights
10: b ← b + y // update bias
11: end if
12: end for
13: end for
Algorithm 2: init_parameters()
1: w , for all d = 1 … D // initialise weights d ← 0
2: b ← 0 // initialise bias term
Algorithm 3: predict(X)
1: preds ← []
2: for all x ∈ X do
3: a ← ∑ x
D
d=1
wd d + b // compute activation for this example
4: yˆ = sign(a) // predict the label
5: preds ← preds + yˆ // append the predicted label
6: return preds
-
Now is the time to take your algorithm for a spin… Use as input vectors (train_X and dev_X)
the best feature representation from Part I, in order to train your perceptron.
2. Perform hyperparameter tuning.
a. Plot the training and development accuracy curves (in a single plot) against
your hyperparameter values (Just upload the file).
b. Briefly explain what is going on in this plot.
Note: Think of what is missing from the Pseudocode in Algorithm 1, in order to
perform this step.
Extra Features
Randomly shuffling the training set before you start a new epoch usually helps (e.g.,
especially when dealing with irrelevant features) and makes the model converge faster.
3. (a) Implement shuffling, compute train/dev accuracy curves and add them to the plot
you computed in the question above. (Upload a new file). (b) Answer very briefly (in
one line) if it helps.
A few useful tips (Don’t skip them!):
- Use the method safe_sparse_dot from scikit-learn (we’ve done the import for
you) when computing dot products with the input vector x. The reason is that the
Vectorizer functions you will use in Part I to create feature vectors, output a sparse
vector for every example (i.e., only contains values for non-zero elements).
- Be careful when computing the labels! Remember that you should output either -1
(negative class), or +1 (positive class).
- Debug your ML algorithm! Don’t just implement the pseudocode… Here are a few
typical steps (based on our experience):
- Inspect the value of the activation and the values of the weights/bias after a
perceptron update step on a single example. Do they do the right thing?
- Try first running your algorithm for 1 epoch; does it give you any errors? Is the
accuracy reasonable?
- What is the accuracy of a majority classifier (simply count the occurrences of
each class, pick the biggest and divide by the total number)? Does your
model do better than that? (We know it should)
- Overfit your model on a very small dataset. Use as training set a very small
portion of train_vecs and make sure your accuracy is increasing. Use the
following dataset sizes=1,2,5,10,100. For sizes=1,2 (maybe bigger too) your
accuracy should be 100%. If not, probably you have a bug!
Part III - Gradient Descent Algorithm [9 marks]
[Implement the (rest of the) body of train(), init_parameters() and predict()in
models/gd.py]
Let’s try to implement a different algorithm: Gradient Descent for a Linear Model using the
Squared Loss and L2-norm Regulariser. Note: Do not expect it to be as optimised as the
scikit-learn implementation (SGDClassifier). This is the reason why we will only train on a
smaller portion of the dataset (10000 examples).
1. Compute the gradients wrt to the weight vector w and bias term b. Enter just the
resulting terms on your response (you don’t need to show the derivation steps).
2. Implement the Gradient Descent Algorithm by adapting the pseudocode found in
Slides 14 and 27 of Week 4. Note: Careful! Slide 27 has an implementation using the
Exponential Loss.
Use as input vectors (train_X and dev_X) the best feature representation from Part I, in order
to train your linear model.
3. Perform hyperparameter tuning. Answer the following questions on the Vision Test.
a. Plot the training and development accuracy curves (in a single plot) against
max_iter for 3 combinations of eta and λ. (Upload the file with the plot that
should contain exactly 6 curves).
b. Briefly explain what is going on in this plot.
c. Plot the training and development average loss curves (in a single plot)
against max_iter for the same combinations as in Part III - 3a above. The
average loss is simply taking the average of the total loss of all examples in
the dataset.
Tips: This is a bit more convoluted since you have 3 hyperparameters (welcome to
the real world of Machine Learning!): max_iter, eta (η) and lambda (λ). You could try
a full grid search but it is going to be very slow. We advise you to set max_iter to
something low (e.g., 10 or 20) and leave it there. Then try with moderate eta (e.g.,
0.0001) and no regularisation (λ=0) and progressively increase λ (don’t go too far).
Try big changes to eta (e.g., 0.001, 0.0001, 0.00001) and do not try all the
intermediate values. When experimenting do not be afraid to “kill” an experiment
early if you see the accuracy not increasing (or loss not decreasing)… Your biggest
enemy here is overshooting (take my word for it, it’s very common!). Finally, at this
stage just print out the accuracies/losses (so you can “kill” the experiment if it
misbehaves) before going to the plots.
The moment of truth...
4. Finally, use the best Perceptron model from Part II and best Linear Model and predict
on the test set. Answer the following questions (very briefly -a few lines) on the Vision
Test:
a. What are their accuracies?
b. Is the comparison fair?
c. How do the two models compare to each other?
Deliverables
You will have to submit the Test in ‘Assignments>Coursework 2 - Test’ on Vision. Over
there, you should (i) upload the completed source code in a .zip file: PLEASE don’t include
the dataset files! (it should be named: Coursework_2_H0XXXXXXX.ipynb, by replacing
the Xs with your Student ID) , (ii) enter your answers to the questions outlined in the
description above, (iii) upload a few of plot images as part of your answers clearly indicated
within the test. It is advisable to properly comment your code. Note: Any 3rd party source
used must be properly cited (see also below).
Important Notes!
- The assignment counts for 25% of the course assessment.
- You are permitted to discuss the coursework with your classmates and of course with
me and the teaching assistant. However, coursework reports must be written in your
own words and the accompanied code must be your own. If some text or code in the
coursework has been taken from other sources, these sources must be properly
referenced. In particular, for pieces of code you get from the web (e.g., from
StackOverflow), minimally you should provide the link where you found it as an inline
comment in the jupyter notebook. Failure to reference work that has been obtained
from other sources or to copy the words and/or code of another student is
plagiarism and if detected, this will be reported to the School's Discipline Committee.
If a student is found guilty of plagiarism, the penalty could involve voiding the course.
- You should never give hard or soft copies of your coursework report or code to
another student. You must always refuse any request from another student for a
copy of your report and/or code. Sharing a coursework report and/or code with
another student is collusion, and if detected, this will be reported to the School's
Discipline Committee. If found guilty of collusion, the penalty could involve voiding
the course. You can find more information about the University’s policy on Plagiarism
here: https://www.hw.ac.uk/uk/students/studies/examinations/plagiarism.htm
- Pay special attention to all the Labs as they should provide lots of insight as to how
to tackle most of the questions in this coursework. The Final Lab is specifically
designed for you to ask any questions related to the coursework. Also, CHECK THE
DOCUMENTATION OF scikit-learn when you are in doubt.
- The test should be submitted by 15:00 UK time on Friday 13 November. You will
find the coursework under the ‘Assessment’ section of the course Vision page. No
individual extensions are permitted under any circumstances. Students who submit
after the deadline but within 5 working days of the deadline will be award
0.7x(awarded coursework mark). Submissions that are more than 5 days late will
receive 0 marks.
- You will receive the final mark, answers, sample code solution and cohort-wide
feedback no later than 15 working days.