CMSC 678辅导、Java,Python编程辅导
- 首页 >> Web Assignment 2
CMSC 678 — Introduction to Machine Learning
Due Wednesday March 9th, 11:59 AM
Item Summary
Assigned Monday February 21
Due Wednesday March 9th
Topic Loss Functions and Multiclass Classification (I)
Points 95 (+15 Extra Credit)
In this assignment you will become comfortable with the math of loss-based optimization; and implement,
experiment with, and compare multiple types of multiclass classifiers.
You are to complete this assignment on your own: that is, the code and writeup you submit must be
entirely your own. However, you may discuss the assignment at a high level with other students or on the
discussion board. Note at the top of your assignment who you discussed this with or what resources you
used (beyond course staff, any course materials, or public Piazza discussions).
The following table gives the overall point breakdown for this assignment.
Theory App.: Classification
Question 1 2 3 4 5
Points 20 35 5 35 15 (EC)
This handout may be lengthy, but think of it as both a tutorial and assignment. I provide a lot of explanation
and re-summarizing of course concepts.
What To Turn In Turn in a PDF writeup that answers the questions; turn in all requested code and models
necessary to replicate your results. Turn in models by serializing the learned models and parameters; your
programming language likely natively supports this (e.g., Python’s pickle, or Pytorch’s own format). Be
sure to include specific instructions on how to build/compile (if necessary) and run your code. Answer the
following questions in long-form. Provide any necessary analyses and discussion of your results.
How To Submit Submit the assignment on the submission site under “Assignment 2.”
https://www.csee.umbc.edu/courses/graduate/678/spring22/submit.
Be sure to select “Assignment 2.”
1
Theory
1. (20 points) For 1 ≤ i ≤ N, let xi ∈ R
2 be a two-dimensional input, and yi ∈ {−1, +1} the binary
label for xi
.
(a) Describe, in both words and with a concrete example, when the perceptron algorithm would be
guaranteed to converge (find an optimal solution to the training set). For the concrete example,
give N ≥ 5 data points {(xi
, yi)} for which the perceptron algorithm would converge.
(b) Describe, in words and with a concrete example, when the perceptron algorithm would not be
guaranteed to converge. For the concrete example, give N ≥ 5 data points {(xi
, yi)} for which
the perceptron algorithm would not converge.
(c) Let the dataset D be
i = xi yi
1 (-1, 1) 1
2 (-1, -1) -1
3 (0.5, 0.5) 1
4 (1, -1) 1
5 (0.5, -1) -1
Run, by hand, the perceptron algorithm on these five data points. Record your computations
for each time step (data point examined, activation score for that data point and current weight
vector, the predicted value, and the new weights) in the following table.
(t)
to have the correct dimensionality. The strange indexing x(t%N)+1
simply says to iterate through the data in the order provided above. Note the second column of
each row should equal the last column of the previous row.
2. (35 points) First, let f(x) be a K-dimensional feature vector, i.e., f(x) ∈ R
K. The particular meaning
of x doesn’t matter for this question: it could be an image, or a text document, or collection of robotic
sensor readings. In class, we discussed how to turn a (binary) linear model into a probabilistic (binary)
classifier: given vector weights w ∈ R
K, we pass our linear scoring model w
|f(x) through the
sigmoid function σ(z) = 1
1+exp (−z)
and use the following decision function:
(
output “class 1” if σ(w
|f(x)) ≥ 0.5
output “class 2” if σ(w
|f(x)) < 0.5.
[Eq-1]
This question considers the multi-class extension of that model.
Second, assume we have L classes, with corresponding weights θl (1 ≤ l ≤ L) per class. We can
group these weights into a matrix θ ∈ R
LxK. Notice that each θl
is a K-dimensional vector.
Consider a probabilistic linear classification model p(ˆy|x),
(a) Argue that a binary instantiation (i.e., L = 2) of model [Eq-2] provides the same decision
function as [Eq-1]. In arguing this, you do not need to prove it in a strict mathematical sense,
but you need to make it clear that you understand it. (Imagine how you would explain it to a
classmate.)
We showed in class that optimizing the MAP estimate is the “right” thing to do when we want to
minimize 0-1 classification loss. Remember that log is monotonic: if f(a) < f(b), then log f(a) <
log f(b). As such, this is also called maximizing the conditional log-likelihood:
the conditional log-likelihood
[Eq-3]
the maximum value (across possible y arguments).
It is sometimes beneficial to interpret a correct label y
in a one-hot format y~?. This one-hot vector y~?
is a vector the size of Y: each coordinate y~?[j] corresponds to one of the j items of Y. In a one-hot
format, all entries are 0 except for the coordinate corresponding to y. As an example, consider the
case of rolling a 1, 2, 3, 4, 5, or 6 (represented by y) from a weighted six-sided die (where this die
outcome represents our label), given some input x: to represent a correct roll of y = 4, a reasonable
one-hot equivalent is y~? = (0, 0, 0, 1, 0, 0). We can use this one-hot format to define the crossentropy
loss, which is the dot product between y~? and the vector of log conditional probability values
log p(y|xi):
(b) Show that, for the probabilistic classifier [Eq-2], maximizing the conditional log-likelihood
([Eq-3]) is the same as minimizing cross-entropy loss ([Eq-4]).1
(c) Given N data points {(x1, y1), . . . ,(xN , yN )}, formulate the empirical risk objective for [Eq-2]
using cross-entropy loss. You can think of this as filling out the following template,
where you must specify
• opt: the optimization goal (e.g., max, min, or something else)
• : the variables or values being optimized
• : the function (and its arguments) being optimized.
(d) Derive the gradient.
1You may be wondering what are the predicted items in cross-entropy loss. The cross-entropy loss measures the correctness of
probabilities corresponding to particular classes, against hard, “gold standard” (correct) judgments. Therefore, the predicted items
are actually probabilities, from which you can easily get a single discrete prediction.
Classification: Implementation, Experimentation, and Discussion
The next three questions lead you through building, experimenting with, reporting on the performance of,
and analyzing a multiclass perceptron classifier. The core deliverables for these questions are:
1. any implementations, scripts, and serialized model files needed to compile and
run your code; and
2. a written report discussing
• your implementations, including any design decisions or difficulties you
experienced [from questions 3-4];
• evidence and analysis of internal development/experimentation on the perceptron
model on the development set [from question 4]; and
• (optionally, for extra credit) results and analysis of a full comparison on
the test set [from 5].
The data we’ll be using is the MNIST digit dataset: in total, there are 70,000 “images” of handwritten
digits (each between 0-9). The task is to predict the digit y ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} from a 28x28 input
grayscale image x. 60,000 of these are allocated for training and 10,000 are allocated for testing. Do not
use the 10,000 testing portion until question 5.
Get the data from https://www.csee.umbc.edu/courses/graduate/678/spring22/
materials/mnist_rowmajor.jsonl.gz. This is a gzipped file, where each line is a JSON object.
Each line has three keys: “image” (a row-major representation of each image), “label” (an int, 0-9), and
“split” (train or test). Each image x is represented as a 784 (= 28 × 28) one-dimensional array (row-major
refers to how each image x should be interpreted in memory).
First, get acquainted with the data: make sure that you can read it properly and you are sufficiently
familiar with the storage format so you can easily use these files in later questions.2 Second, you are to
split the 60,000 training set into an internal training set (int-train) and an internal development set
(int-dev). How you split (e.g., randomly, stratified sampling, taking the first M), and the resulting sizes
of int-train and int-dev are up to you.
3. (5 points) Implement a “most frequent” baseline classifier. This baseline identifies the label that was
most common in training (int-train) and always produces that label during evaluation (regardless
of the input).
(a) In your report, document the implementation (this should be fairly straight-forward).
(b) In your report, report how well this baseline does on int-dev.
2Notice that each component of x is a float (0 ≤ xi ≤ 1). If you want to visualize the images, you can apply the sign function
(with τ = 0) and print its output in a 28x28 grid
signτ
(xi) = (
1 xi > τ
0 xi ≤ τ.
[Eq-6]
Note that [Eq-6] provides a binary representation of the input x.
You do not have to do the following but you may find it helps you get a handle on the data: visualize the representations
(raw/original, binary, or with arbitrary τ ) by creating a heatmap/tile plot. This way you can “look” at your data and verify that the
labels line up to their respective images (and make sense!).
4. (35 points) Implement a multiclass perceptron. Train it on int-train and evaluate it on int-dev.
For this question, you may use computation accelerators (e.g., blas, numpy, MATLAB) but you may
not use any existing perceptron implementation.
(a) In your report, discuss your implementation and convergence criteria (this should also be fairly
brief).
(b) In your report, discuss the concrete, quantifiable ways you validated your implementation was
correct. One such way could be to run it on the toy data in Question 1, though there are other
ways too.
(c) In your report, experiment with at least three different configurations, where you train each
configuration on int-train and then evaluate on int-test. Document this internal development
progress through quantifiable and readable means, i.e., graphs and/or tables showing
how different model configurations perform on int-dev. A configuration could include learning
biases or not, the convergence criteria, or the input featurization, among others.
Implementation Hint: Training a multiclass perceptron follows the training for a binary perceptron
as discussed in class, with the following four changes: first, rather than there being a single weight
vector w, there is now a weight vector wj per class j. Second, a single prediction is obtained from
the maximum of a vector of predictions ~y, using each of the class-specific weight vectors. That is,
~y[j] = w
|
j
x and the predicted value y = arg maxj ~y[j]. Third, if a prediction is incorrect, the correct
weights wy
? are increased by x, and the mispredicted weights wy are decreased by x. Fourth, a
per-class bias replaces the single bias term in the binary perceptron algorithm.
5. (Extra Credit: 15 points) Using the best perceptron configuration found in the previous question,
train new a perceptron model on the entire, original train set; also “re-train” a most-frequent classifier
baseline. At the end of this training, you should have two models trained on the entire 60,000
training set. Evaluate these models on the original 10,000 image test set. Do not experiment with
different configuration values here (that’s what your int-dev set was for!). Include these evaluation
results in your report and discuss the results: were they surprising? How similar were the test results
to the best internal development results?
CMSC 678 — Introduction to Machine Learning
Due Wednesday March 9th, 11:59 AM
Item Summary
Assigned Monday February 21
Due Wednesday March 9th
Topic Loss Functions and Multiclass Classification (I)
Points 95 (+15 Extra Credit)
In this assignment you will become comfortable with the math of loss-based optimization; and implement,
experiment with, and compare multiple types of multiclass classifiers.
You are to complete this assignment on your own: that is, the code and writeup you submit must be
entirely your own. However, you may discuss the assignment at a high level with other students or on the
discussion board. Note at the top of your assignment who you discussed this with or what resources you
used (beyond course staff, any course materials, or public Piazza discussions).
The following table gives the overall point breakdown for this assignment.
Theory App.: Classification
Question 1 2 3 4 5
Points 20 35 5 35 15 (EC)
This handout may be lengthy, but think of it as both a tutorial and assignment. I provide a lot of explanation
and re-summarizing of course concepts.
What To Turn In Turn in a PDF writeup that answers the questions; turn in all requested code and models
necessary to replicate your results. Turn in models by serializing the learned models and parameters; your
programming language likely natively supports this (e.g., Python’s pickle, or Pytorch’s own format). Be
sure to include specific instructions on how to build/compile (if necessary) and run your code. Answer the
following questions in long-form. Provide any necessary analyses and discussion of your results.
How To Submit Submit the assignment on the submission site under “Assignment 2.”
https://www.csee.umbc.edu/courses/graduate/678/spring22/submit.
Be sure to select “Assignment 2.”
1
Theory
1. (20 points) For 1 ≤ i ≤ N, let xi ∈ R
2 be a two-dimensional input, and yi ∈ {−1, +1} the binary
label for xi
.
(a) Describe, in both words and with a concrete example, when the perceptron algorithm would be
guaranteed to converge (find an optimal solution to the training set). For the concrete example,
give N ≥ 5 data points {(xi
, yi)} for which the perceptron algorithm would converge.
(b) Describe, in words and with a concrete example, when the perceptron algorithm would not be
guaranteed to converge. For the concrete example, give N ≥ 5 data points {(xi
, yi)} for which
the perceptron algorithm would not converge.
(c) Let the dataset D be
i = xi yi
1 (-1, 1) 1
2 (-1, -1) -1
3 (0.5, 0.5) 1
4 (1, -1) 1
5 (0.5, -1) -1
Run, by hand, the perceptron algorithm on these five data points. Record your computations
for each time step (data point examined, activation score for that data point and current weight
vector, the predicted value, and the new weights) in the following table.
(t)
to have the correct dimensionality. The strange indexing x(t%N)+1
simply says to iterate through the data in the order provided above. Note the second column of
each row should equal the last column of the previous row.
2. (35 points) First, let f(x) be a K-dimensional feature vector, i.e., f(x) ∈ R
K. The particular meaning
of x doesn’t matter for this question: it could be an image, or a text document, or collection of robotic
sensor readings. In class, we discussed how to turn a (binary) linear model into a probabilistic (binary)
classifier: given vector weights w ∈ R
K, we pass our linear scoring model w
|f(x) through the
sigmoid function σ(z) = 1
1+exp (−z)
and use the following decision function:
(
output “class 1” if σ(w
|f(x)) ≥ 0.5
output “class 2” if σ(w
|f(x)) < 0.5.
[Eq-1]
This question considers the multi-class extension of that model.
Second, assume we have L classes, with corresponding weights θl (1 ≤ l ≤ L) per class. We can
group these weights into a matrix θ ∈ R
LxK. Notice that each θl
is a K-dimensional vector.
Consider a probabilistic linear classification model p(ˆy|x),
(a) Argue that a binary instantiation (i.e., L = 2) of model [Eq-2] provides the same decision
function as [Eq-1]. In arguing this, you do not need to prove it in a strict mathematical sense,
but you need to make it clear that you understand it. (Imagine how you would explain it to a
classmate.)
We showed in class that optimizing the MAP estimate is the “right” thing to do when we want to
minimize 0-1 classification loss. Remember that log is monotonic: if f(a) < f(b), then log f(a) <
log f(b). As such, this is also called maximizing the conditional log-likelihood:
the conditional log-likelihood
[Eq-3]
the maximum value (across possible y arguments).
It is sometimes beneficial to interpret a correct label y
in a one-hot format y~?. This one-hot vector y~?
is a vector the size of Y: each coordinate y~?[j] corresponds to one of the j items of Y. In a one-hot
format, all entries are 0 except for the coordinate corresponding to y. As an example, consider the
case of rolling a 1, 2, 3, 4, 5, or 6 (represented by y) from a weighted six-sided die (where this die
outcome represents our label), given some input x: to represent a correct roll of y = 4, a reasonable
one-hot equivalent is y~? = (0, 0, 0, 1, 0, 0). We can use this one-hot format to define the crossentropy
loss, which is the dot product between y~? and the vector of log conditional probability values
log p(y|xi):
(b) Show that, for the probabilistic classifier [Eq-2], maximizing the conditional log-likelihood
([Eq-3]) is the same as minimizing cross-entropy loss ([Eq-4]).1
(c) Given N data points {(x1, y1), . . . ,(xN , yN )}, formulate the empirical risk objective for [Eq-2]
using cross-entropy loss. You can think of this as filling out the following template,
where you must specify
• opt: the optimization goal (e.g., max, min, or something else)
• : the variables or values being optimized
• : the function (and its arguments) being optimized.
(d) Derive the gradient.
1You may be wondering what are the predicted items in cross-entropy loss. The cross-entropy loss measures the correctness of
probabilities corresponding to particular classes, against hard, “gold standard” (correct) judgments. Therefore, the predicted items
are actually probabilities, from which you can easily get a single discrete prediction.
Classification: Implementation, Experimentation, and Discussion
The next three questions lead you through building, experimenting with, reporting on the performance of,
and analyzing a multiclass perceptron classifier. The core deliverables for these questions are:
1. any implementations, scripts, and serialized model files needed to compile and
run your code; and
2. a written report discussing
• your implementations, including any design decisions or difficulties you
experienced [from questions 3-4];
• evidence and analysis of internal development/experimentation on the perceptron
model on the development set [from question 4]; and
• (optionally, for extra credit) results and analysis of a full comparison on
the test set [from 5].
The data we’ll be using is the MNIST digit dataset: in total, there are 70,000 “images” of handwritten
digits (each between 0-9). The task is to predict the digit y ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} from a 28x28 input
grayscale image x. 60,000 of these are allocated for training and 10,000 are allocated for testing. Do not
use the 10,000 testing portion until question 5.
Get the data from https://www.csee.umbc.edu/courses/graduate/678/spring22/
materials/mnist_rowmajor.jsonl.gz. This is a gzipped file, where each line is a JSON object.
Each line has three keys: “image” (a row-major representation of each image), “label” (an int, 0-9), and
“split” (train or test). Each image x is represented as a 784 (= 28 × 28) one-dimensional array (row-major
refers to how each image x should be interpreted in memory).
First, get acquainted with the data: make sure that you can read it properly and you are sufficiently
familiar with the storage format so you can easily use these files in later questions.2 Second, you are to
split the 60,000 training set into an internal training set (int-train) and an internal development set
(int-dev). How you split (e.g., randomly, stratified sampling, taking the first M), and the resulting sizes
of int-train and int-dev are up to you.
3. (5 points) Implement a “most frequent” baseline classifier. This baseline identifies the label that was
most common in training (int-train) and always produces that label during evaluation (regardless
of the input).
(a) In your report, document the implementation (this should be fairly straight-forward).
(b) In your report, report how well this baseline does on int-dev.
2Notice that each component of x is a float (0 ≤ xi ≤ 1). If you want to visualize the images, you can apply the sign function
(with τ = 0) and print its output in a 28x28 grid
signτ
(xi) = (
1 xi > τ
0 xi ≤ τ.
[Eq-6]
Note that [Eq-6] provides a binary representation of the input x.
You do not have to do the following but you may find it helps you get a handle on the data: visualize the representations
(raw/original, binary, or with arbitrary τ ) by creating a heatmap/tile plot. This way you can “look” at your data and verify that the
labels line up to their respective images (and make sense!).
4. (35 points) Implement a multiclass perceptron. Train it on int-train and evaluate it on int-dev.
For this question, you may use computation accelerators (e.g., blas, numpy, MATLAB) but you may
not use any existing perceptron implementation.
(a) In your report, discuss your implementation and convergence criteria (this should also be fairly
brief).
(b) In your report, discuss the concrete, quantifiable ways you validated your implementation was
correct. One such way could be to run it on the toy data in Question 1, though there are other
ways too.
(c) In your report, experiment with at least three different configurations, where you train each
configuration on int-train and then evaluate on int-test. Document this internal development
progress through quantifiable and readable means, i.e., graphs and/or tables showing
how different model configurations perform on int-dev. A configuration could include learning
biases or not, the convergence criteria, or the input featurization, among others.
Implementation Hint: Training a multiclass perceptron follows the training for a binary perceptron
as discussed in class, with the following four changes: first, rather than there being a single weight
vector w, there is now a weight vector wj per class j. Second, a single prediction is obtained from
the maximum of a vector of predictions ~y, using each of the class-specific weight vectors. That is,
~y[j] = w
|
j
x and the predicted value y = arg maxj ~y[j]. Third, if a prediction is incorrect, the correct
weights wy
? are increased by x, and the mispredicted weights wy are decreased by x. Fourth, a
per-class bias replaces the single bias term in the binary perceptron algorithm.
5. (Extra Credit: 15 points) Using the best perceptron configuration found in the previous question,
train new a perceptron model on the entire, original train set; also “re-train” a most-frequent classifier
baseline. At the end of this training, you should have two models trained on the entire 60,000
training set. Evaluate these models on the original 10,000 image test set. Do not experiment with
different configuration values here (that’s what your int-dev set was for!). Include these evaluation
results in your report and discuss the results: were they surprising? How similar were the test results
to the best internal development results?