辅导S6140编程、辅导Python,C++程序
- 首页 >> Database作业 S6140: Machine Learning
Homework Assignment # 4
Assigned: 11/19/2022 Due: 12/03/2022, 11:59pm, through Canvas
Four problems, 150 points in total. Good luck!
Prof. Predrag Radivojac, Northeastern University
Problem 1. (40 points) Consider two classification concepts given in Figure 1, where x ∈ X = [?6, 6] ×
[?4, 4], y ∈ Y = {?1,+1} and p(y|x) ∈ {0, 1} is defined in the drawing.
Figure 1: Two concepts where examples that fall within any of the three 3 × 3 (panel A) or 1 × 1 (panel
B) squares are labeled positive and the remaining examples (outside each of the squares but within X ) are
labeled negative. The position of the point x = (x1, x2) in the upper left-hand corner for each square is
shown in the picture. Consider horizontal axis to be x1 and vertical axis as x2.
Your experiments in this question will rely on generating a data set of size n ∈ {250, 1000, 10000} drawn from
a uniform distribution in X and labeled according to the rules from Figure 1; e.g., P (Y = 1|x) = 0.97 if x
that was randomly drawn is inside any of the three squares in either of the two panels, and P (Y = 1|x) = 0.01
otherwise (notice that this introduces a certain amount of asymmetric noise in the data). The goal of the
following two problems will be to train and evaluate classifiers created from the data generated in this way.
You can use any library you want in this assignment and do programming in Python, MATLAB, or R. Your
code should be easy to run for each question and sub-question below so that we can replicate your results
to the maximum extent possible.
Consider single-output feed-forward neural networks with one or two hidden layers such that the number of
hidden neurons in each layer is h1 ∈ {1, 4, 12} and h2 ∈ {0, 3}, respectively, with h2 = 0 meaning that there
is no second hidden layer. Consider one of the standard objective functions as your optimization criterion
and use early stopping and regularization as needed. Consider a hyperbolic tangent activation function in
each neuron and the output but you are free to experiment with others if you’d like to. For each of the
architectures, defined by a parameter combination (h1, h2), evaluate the performance of each model using
classification accuracy, balanced accuracy, and area under the ROC curve as your two performance criteria.
To evaluate the performance of your models use cross-validation. However, to evaluate the performance of
performance evaluation, generate another very large data set on a fine grid in X . Then use the predictions
1
2 Homework Assignment # 4
from your trained model on all these points to determine the “true” performance. You can threshold your
predictions in the middle of your prediction range (i.e., at 0.5 if you are predicting between 0 and 1) to
determine binary predictions of your models and to then compare those with true class labels you generated
on the fine grid.
Provide meaningful comments about all aspects of this exercise (performance results for different network
architectures, accuracy in and of cross-validation, true accuracy when you disregard class-label noise, run
time, etc.). The comments should not just re-state the results but rather capture trends and give reasoning
as to why certain behavior was observed.
Problem 2. (35 points) Matrix factorization with applications. Consider an n× d real-valued data matrix
X = (xT1 ,x
T
2 , . . . ,x
T
n ). We will attempt to approximate this matrix using the following factorization
X? = UVT
where U = (uT1 ,u
T
2 , . . . ,u
T
n ) is an n × k and V = (vT1 ,vT2 , . . . ,vTd ) is a d × k matrix of real numbers, and
where 0 < k < n, d is a user-set parameter. Notice that the value xij in X can be approximated by u
T
i vj
and that xTi , the i-th row of X, can be approximated by u
T
i V
T , giving x?i = Vui. We will formulate the
matrix factorization process as the following minimization
min
U,V
∑
i,j
(xij ? uTi vj)2 + λ(
∑
i
||ui||2 +
∑
j
||vj ||2)
which minimizes the sum-of-squared-errors between real values xij and reconstructed values x?ij = u
T
i vj .
The regularization parameter λ ≥ 0 is user-selected. This problem can be directly solved using gradient
descent, but we will attempt a slightly different approach. To do this, we can see that for a fixed V we can
find optimal vectors ui by minimizing
||Vui ? xi||2 + λ · ||ui||2
which can be solved in a closed form using OLS regression for every i. We can equivalently express the
optimization for vectors vj and find the solution for every j. Then, we can alternate these steps until
convergence. This procedure is called the Alternating Least Squares (ALS) algorithm for matrix factorization.
It has the following steps:
Initialize U and V
repeat
for i = 1 to n
ui = formula #1
end
for j = 1 to d
vj = formula #2
end
until convergence
where you are expected to derive formula #1 and formula #2.
a) (15 points) Derive the optimization steps for the ALS algorithm by finding formula #1 and formula
#2 in the pseudocode listed above.
b) (20 points) Consider now that some values in X are missing (e.g., the rows of X are users and the
columns of X are movie ratings, when available) and that we are interested in carrying out matrix
completion using matrix factorization presented above. We would like to use the ALS algorithm, but
the problem is that we must exclude all missing values from optimization. Derive now a modified ALS
algorithm (formulas #1 and #2) to adapt it for matrix completion. Hint: consider adding an indicator
matrix W to the optimization process, where wij = 1 if xij is available and wij = 0 otherwise.
Homework Assignment # 4 3
Problem 3. (15 points) Prove representational equivalence of a three-layer neural network with linear
activation function in all neurons and a single-layer layer neural network with the same activation function.
Assume a single-output network.
Problem 4. (60 points) Assess the role of biased data on binary classification performance. This will be
accomplished using synthetic data. First, consider the following class-conditional distributions
p(x|Y = y) =
m∑
k=1
wky · N (μky,Σky),
where 0 < wky ≤ 1,
∑m
k=1 wky = 1, x ∈ R2, y ∈ {0, 1} and m ≥ 1. Then, generate a training set D of nD
input-output pairs to satisfy
p(x) = p(x|Y = 0)P (Y = 0) + p(x|Y = 1)P (Y = 1),
where P (Y = 1) ∈ (0, 1). After the training set is generated, create different biased test sets on which you
will evaluate the quality of your training and performance estimation. The test set T will consist of nT
input-output examples generated as follows:
a) (20 points) Your test data is unbiased. That is, T is constructed from pˉ(x, y) = p(x, y).
b) (20 points) Your test data is non-representative according to the label-shift bias model; i.e.,
pˉ(x) = p(x|Y = 0)Pˉ (Y = 0) + p(x|Y = 1)Pˉ (Y = 1),
Note that the class-conditionals are unchanged in the test data, but that the class-priors are different;
i.e., Pˉ (Y = 0) ?= P (Y = 0).
c) (20 points) Your test data is non-representative according to the covariate-shift bias model; i.e.,
pˉ(x) = P (Y = 0|x)pˉ(x) + P (Y = 1|x)pˉ(x),
where the posteriors are unchanged but the marginal data distributions are distinct; i.e., pˉ(x) ?= p(x).
Make sure that you clearly explain how you ensured that this condition holds.
The goal of the exercise is to train your models and estimate their accuracy on D but then use test set
T to check the performance of your trained models “in the wild”. You can select nD, nT , and all model
parameters manually to allow for stable estimation (e.g., nD and nT can be large), but note that the area
under the ROC curve (AUC) of your best classifier on the unbiased data in part (a) should not exceed 0.95
or be below 0.65. Then train at least 3 different prediction models to evaluate their sensitivity to data biases.
You must use logistic regression, naive Bayes, and at least one non-linear neural network with at least two
layers. You can use other models as well.
Use 10-fold cross-validation protocol on D and AUC to estimate the performance of your model. Then give
AUC on the biased test set. Repeat your data generation and evaluation multiple times to verify that the test
performance is within confidence intervals of the estimated performance. Discuss your results, visualize data
distributions (the data is 2D) and potentially (not mandatory) use data sets with different level of overlap
between class-conditional distributions (between positives and negatives) to strengthen your conclusions on
the impact of problem difficulty on the impact of bias.
4 Homework Assignment # 4
Directions and Policies
Submit a single package containing all answers, results and code. Your submission package should be
compressed and named firstnamelastname.zip (e.g., predragradivojac.zip). In your package there should
be a single pdf file named main.pdf that will contain answers to all questions, all figures, and all relevant
results. Your solutions and answers must be typed1 and make sure that you type your name and Northeastern
username (email) on top of the first page of the main.pdf file. The rest of the package should contain all
code that you used. The code should be properly organized in folders and subfolders, one for each question
or problem. All code, if applicable, should be turned in when you submit your assignment as it may be
necessary to demo your programs to the teaching assistants. Use Matlab, Python or R.
Unless there are legitimate circumstances, late assignments will be accepted up to 5 days after the due date
and graded using the following rules:
on time: your score × 1
1 day late: your score × 0.9
2 days late: your score × 0.7
3 days late: your score × 0.5
4 days late: your score × 0.3
5 days late: your score × 0.1
For example, this means that if you submit 3 days late and get 80 points for your answers, your total number
of points will be 80 × 0.5 = 40 points.
All assignments are individual, except when collaboration is explicitly allowed. All text must be be your
own or, for group assignments (when explicitly permitted), by the members of the group. All
sources used for problem solution must be acknowledged; e.g., web sites, books, research papers,
personal communication with people, etc. Academic honesty is taken seriously! For detailed information see
Office of Student Conduct and Conflict Resolution.
1We recommend Latex; in particular, TexShop-MacTeX combination for a Mac and TeXnicCenter-MiKTex combination
on Windows. An easy way to start with Latex is to use the freely available Lyx. You can also use Microsoft Word or other
programs that can display formulas professionally.
Homework Assignment # 4
Assigned: 11/19/2022 Due: 12/03/2022, 11:59pm, through Canvas
Four problems, 150 points in total. Good luck!
Prof. Predrag Radivojac, Northeastern University
Problem 1. (40 points) Consider two classification concepts given in Figure 1, where x ∈ X = [?6, 6] ×
[?4, 4], y ∈ Y = {?1,+1} and p(y|x) ∈ {0, 1} is defined in the drawing.
Figure 1: Two concepts where examples that fall within any of the three 3 × 3 (panel A) or 1 × 1 (panel
B) squares are labeled positive and the remaining examples (outside each of the squares but within X ) are
labeled negative. The position of the point x = (x1, x2) in the upper left-hand corner for each square is
shown in the picture. Consider horizontal axis to be x1 and vertical axis as x2.
Your experiments in this question will rely on generating a data set of size n ∈ {250, 1000, 10000} drawn from
a uniform distribution in X and labeled according to the rules from Figure 1; e.g., P (Y = 1|x) = 0.97 if x
that was randomly drawn is inside any of the three squares in either of the two panels, and P (Y = 1|x) = 0.01
otherwise (notice that this introduces a certain amount of asymmetric noise in the data). The goal of the
following two problems will be to train and evaluate classifiers created from the data generated in this way.
You can use any library you want in this assignment and do programming in Python, MATLAB, or R. Your
code should be easy to run for each question and sub-question below so that we can replicate your results
to the maximum extent possible.
Consider single-output feed-forward neural networks with one or two hidden layers such that the number of
hidden neurons in each layer is h1 ∈ {1, 4, 12} and h2 ∈ {0, 3}, respectively, with h2 = 0 meaning that there
is no second hidden layer. Consider one of the standard objective functions as your optimization criterion
and use early stopping and regularization as needed. Consider a hyperbolic tangent activation function in
each neuron and the output but you are free to experiment with others if you’d like to. For each of the
architectures, defined by a parameter combination (h1, h2), evaluate the performance of each model using
classification accuracy, balanced accuracy, and area under the ROC curve as your two performance criteria.
To evaluate the performance of your models use cross-validation. However, to evaluate the performance of
performance evaluation, generate another very large data set on a fine grid in X . Then use the predictions
1
2 Homework Assignment # 4
from your trained model on all these points to determine the “true” performance. You can threshold your
predictions in the middle of your prediction range (i.e., at 0.5 if you are predicting between 0 and 1) to
determine binary predictions of your models and to then compare those with true class labels you generated
on the fine grid.
Provide meaningful comments about all aspects of this exercise (performance results for different network
architectures, accuracy in and of cross-validation, true accuracy when you disregard class-label noise, run
time, etc.). The comments should not just re-state the results but rather capture trends and give reasoning
as to why certain behavior was observed.
Problem 2. (35 points) Matrix factorization with applications. Consider an n× d real-valued data matrix
X = (xT1 ,x
T
2 , . . . ,x
T
n ). We will attempt to approximate this matrix using the following factorization
X? = UVT
where U = (uT1 ,u
T
2 , . . . ,u
T
n ) is an n × k and V = (vT1 ,vT2 , . . . ,vTd ) is a d × k matrix of real numbers, and
where 0 < k < n, d is a user-set parameter. Notice that the value xij in X can be approximated by u
T
i vj
and that xTi , the i-th row of X, can be approximated by u
T
i V
T , giving x?i = Vui. We will formulate the
matrix factorization process as the following minimization
min
U,V
∑
i,j
(xij ? uTi vj)2 + λ(
∑
i
||ui||2 +
∑
j
||vj ||2)
which minimizes the sum-of-squared-errors between real values xij and reconstructed values x?ij = u
T
i vj .
The regularization parameter λ ≥ 0 is user-selected. This problem can be directly solved using gradient
descent, but we will attempt a slightly different approach. To do this, we can see that for a fixed V we can
find optimal vectors ui by minimizing
||Vui ? xi||2 + λ · ||ui||2
which can be solved in a closed form using OLS regression for every i. We can equivalently express the
optimization for vectors vj and find the solution for every j. Then, we can alternate these steps until
convergence. This procedure is called the Alternating Least Squares (ALS) algorithm for matrix factorization.
It has the following steps:
Initialize U and V
repeat
for i = 1 to n
ui = formula #1
end
for j = 1 to d
vj = formula #2
end
until convergence
where you are expected to derive formula #1 and formula #2.
a) (15 points) Derive the optimization steps for the ALS algorithm by finding formula #1 and formula
#2 in the pseudocode listed above.
b) (20 points) Consider now that some values in X are missing (e.g., the rows of X are users and the
columns of X are movie ratings, when available) and that we are interested in carrying out matrix
completion using matrix factorization presented above. We would like to use the ALS algorithm, but
the problem is that we must exclude all missing values from optimization. Derive now a modified ALS
algorithm (formulas #1 and #2) to adapt it for matrix completion. Hint: consider adding an indicator
matrix W to the optimization process, where wij = 1 if xij is available and wij = 0 otherwise.
Homework Assignment # 4 3
Problem 3. (15 points) Prove representational equivalence of a three-layer neural network with linear
activation function in all neurons and a single-layer layer neural network with the same activation function.
Assume a single-output network.
Problem 4. (60 points) Assess the role of biased data on binary classification performance. This will be
accomplished using synthetic data. First, consider the following class-conditional distributions
p(x|Y = y) =
m∑
k=1
wky · N (μky,Σky),
where 0 < wky ≤ 1,
∑m
k=1 wky = 1, x ∈ R2, y ∈ {0, 1} and m ≥ 1. Then, generate a training set D of nD
input-output pairs to satisfy
p(x) = p(x|Y = 0)P (Y = 0) + p(x|Y = 1)P (Y = 1),
where P (Y = 1) ∈ (0, 1). After the training set is generated, create different biased test sets on which you
will evaluate the quality of your training and performance estimation. The test set T will consist of nT
input-output examples generated as follows:
a) (20 points) Your test data is unbiased. That is, T is constructed from pˉ(x, y) = p(x, y).
b) (20 points) Your test data is non-representative according to the label-shift bias model; i.e.,
pˉ(x) = p(x|Y = 0)Pˉ (Y = 0) + p(x|Y = 1)Pˉ (Y = 1),
Note that the class-conditionals are unchanged in the test data, but that the class-priors are different;
i.e., Pˉ (Y = 0) ?= P (Y = 0).
c) (20 points) Your test data is non-representative according to the covariate-shift bias model; i.e.,
pˉ(x) = P (Y = 0|x)pˉ(x) + P (Y = 1|x)pˉ(x),
where the posteriors are unchanged but the marginal data distributions are distinct; i.e., pˉ(x) ?= p(x).
Make sure that you clearly explain how you ensured that this condition holds.
The goal of the exercise is to train your models and estimate their accuracy on D but then use test set
T to check the performance of your trained models “in the wild”. You can select nD, nT , and all model
parameters manually to allow for stable estimation (e.g., nD and nT can be large), but note that the area
under the ROC curve (AUC) of your best classifier on the unbiased data in part (a) should not exceed 0.95
or be below 0.65. Then train at least 3 different prediction models to evaluate their sensitivity to data biases.
You must use logistic regression, naive Bayes, and at least one non-linear neural network with at least two
layers. You can use other models as well.
Use 10-fold cross-validation protocol on D and AUC to estimate the performance of your model. Then give
AUC on the biased test set. Repeat your data generation and evaluation multiple times to verify that the test
performance is within confidence intervals of the estimated performance. Discuss your results, visualize data
distributions (the data is 2D) and potentially (not mandatory) use data sets with different level of overlap
between class-conditional distributions (between positives and negatives) to strengthen your conclusions on
the impact of problem difficulty on the impact of bias.
4 Homework Assignment # 4
Directions and Policies
Submit a single package containing all answers, results and code. Your submission package should be
compressed and named firstnamelastname.zip (e.g., predragradivojac.zip). In your package there should
be a single pdf file named main.pdf that will contain answers to all questions, all figures, and all relevant
results. Your solutions and answers must be typed1 and make sure that you type your name and Northeastern
username (email) on top of the first page of the main.pdf file. The rest of the package should contain all
code that you used. The code should be properly organized in folders and subfolders, one for each question
or problem. All code, if applicable, should be turned in when you submit your assignment as it may be
necessary to demo your programs to the teaching assistants. Use Matlab, Python or R.
Unless there are legitimate circumstances, late assignments will be accepted up to 5 days after the due date
and graded using the following rules:
on time: your score × 1
1 day late: your score × 0.9
2 days late: your score × 0.7
3 days late: your score × 0.5
4 days late: your score × 0.3
5 days late: your score × 0.1
For example, this means that if you submit 3 days late and get 80 points for your answers, your total number
of points will be 80 × 0.5 = 40 points.
All assignments are individual, except when collaboration is explicitly allowed. All text must be be your
own or, for group assignments (when explicitly permitted), by the members of the group. All
sources used for problem solution must be acknowledged; e.g., web sites, books, research papers,
personal communication with people, etc. Academic honesty is taken seriously! For detailed information see
Office of Student Conduct and Conflict Resolution.
1We recommend Latex; in particular, TexShop-MacTeX combination for a Mac and TeXnicCenter-MiKTex combination
on Windows. An easy way to start with Latex is to use the freely available Lyx. You can also use Microsoft Word or other
programs that can display formulas professionally.