代做Statistical Machine Learning GR5241 Spring 2023代做Python语言

2024.03.26 - 首页 >> CS

Statistical Machine Learning GR5241

Spring 2023

Homework 3

Due: 04/11/2023 by 11:59pm

Homework submission: Please submit your homework as a pdf in Canvas.

Problem 1 (PCA, 10 points)

Consider a random vector X e Rp with mean E[X] = 0 and variance-covariance matrix Cov[X] = Σ . Note that Σ is a p X p dimensional matrix. Denote the jth eigenvalue and jth eigenvector of Σ as λj and ϕj , respectively. Define the random score vector Z as

Z = ΦTX,

where Φ is the rotation matrix with its columns being the eigenvectors ϕj , i.e.,

Φ = (ϕ1 | ϕ2 | · · · | ϕp )

Perform the following task:

Show that the variance-covariance matrix of random score vector Z is

'( 0(λ1)

ΣZ = ' .

' .

( 0(.)

···(···) 0(0) )'

. . . '

···(.) p )

Problem 2 (Kernels [30 points])

Please see the background information provided below. The required tasks for problem 2 are described later.

Recall the definition of a kernel: A function k : Rd X Rd 一 R is called a kernel on Rd if there is some function ϕ : Rd 一 F into some space F with scalar product〈·, ·) F such that

k(x, x′ ) =〈ϕ(x),ϕ(x′ )）F for all x, x′ e Rd.

How do you prove k is a valid kernel? Two recommendations (RI) and (R2) follow:

RI: The first recommendation is a direct application of Mercer’s theorem.

. To prove k(x, x′ ) is a valid kernel, it’s sufficient to show that k(x, x′ ) is positive-definite, i.e., for any function f : Rd 一 R,

lRd ×Rd k1 (x, x′ )f(x)f(x′ )dxdx′ ≥ 0.

. Note that a valid kernel k(x, x′ ) must always be symmetric. Before showing positive-definiteness, make sure to check that k(x, x′ ) symmetric. This step is often glossed over because a theorist would never consider a non-symmetric kernel, and hence prove its validity.

RII: The second recommendation uses the definition of a kernel function directly. Note that it is often convenient to invoke Mercer’s theorem.

. To prove k(x, x\ ) is a valid kernel, you can directly identify the function ϕ : Rd 一 F such that

k(x, x\ ) =〈ϕ(x),ϕ(x\ )）F for all x, x\ e Rd.

For simple examples, the function ϕ is often easy to identify.

. For more complicated cases, it’s convenient to invoke Mercer’s theorem and exploit the decomposition:

k(x, x\ ) = λj ϕj (x)ϕj (x\ )

Required tasks: (2.i) - (2.vii)

Let x, x\ e Rd and assume that k1 (x, x\ ) and k2 (x, x\ ) are valid kernels. Note that you can assume both k1 , k2 are symmetric and positive-definite. Show the following kernels are also valid:

2.i [4 points] Prove that the scalar product k* (x, x\ ) =〈x, x\）is a valid kernel. What is the ϕ function for the scalar product? This is the most basic case.

2.ii [4 points] Show k(x, x\ ) = k1 (x, x\ ) + k2 (x, x\ ) is a valid kernel. Let ϕ1 and ϕ2 be associated with k1 and k2 respectively. Construct the ϕ function for k using ϕ1 and ϕ2 .

2.iii [4 points] For real a > 0, show k* (x, x\ ) = ak1 (x, x\ ) is a valid kernel and identify the ϕ function.

2.iv [4 points] For any function g : Rd 一 R, show k* (x, x\ ) = g(x)k1 (x, x\ )g(x\ ) is a valid kernel.

2.v [4 points] Show k* (x, x\ ) = k1 (x, x\ )k2 (x, x\ ) is a valid kernel. What is the ϕ function corresponding to k*?

2.vi [5 points] Show that the radial basis kernel projects vectors to an infinite dimensional space. Construct the ϕ function corresponding to radial basis kernel?

2.vii [5 points] This problem is an application of the radial basis SVM and there is no required proof. Using the dataset “HW3Problem2-Spring2023.csv”, run the radial basis SVM using 4 bandwidths. Plot the resulting decision boundaries with the training data. Use a built-in function from R or Python to solve tis problem, .e.g., svm().

Problem 3 Gradient descent in Regression[20 points]

The goal of this question is to estimate the least squares regression model using gradient descent as opposed to the traditional least squares equation. Let d represent the number of features and n be the sample size of the training data.

Q(β) = (yi — (β1 xi1 + β2 xi2 + ··· + βd xi,d ))2 ,

'(β(β)2(1))'

where β = ' . '

(d )

For simplicity, the intercept (or bias) term is dropped but this is not recommended in practice. Note that you can easily minimize Q(β) using the R code lm(y…X1+X2+ ··· +Xd-1,data=data). The “-1” forces the intercept to be zero.

Required tasks: (3.i) - (3.iv)

3.i [5 points] Derive a closed form expression for ΔQ(β) and write the gradient descent update function in terms of your expression ΔQ(β).

3.ii [5 points] Consider the dataset “auto-mpg.data” . The dataset is available at AutoMPG.data. You will first download the data. Use the attributes displacement, horsepower, acceleration as predictors for mpg. Choosing learning rate η = 0.01?, run the gradient descent algorithm for t = 1000 iterations. Break the algorithm using threshold ϵ = .01, where

' ΔQ(β(t))j ' < .01, for all, j = 1, 2,...,d.

Note that ' ΔQ(β(t))j ' represents the absolute value of the gradient for coefficient j. Did your algorithm converge? Describe whether or not your algorithm converged using a plot of mean square prediction error vs the iteration number.

3.iii [5 points] Repeat part 2 using learning rate η = 0.001 and η = 0.0001. Did your algorithm converge when using these two learning rates? Plot the mean square prediction error vs iteration number using 3 different learning rate η = 0.01, 0.001, 0.0001 in the same chart.

3.iv [5 points] From your observations in part 3 comment on the effect of learning rate in the convergence of gradient descent. Would you prefer to work with a small learning rate or large learning rate when solving a regression using gradient descent. Justify your choice.

Problem 4 (Training the Linearly Separable SVM [25 points])

Figure 1: Problem 4 linearly separable data.

Set-up and background:

The goal of this question is to manually code an optimization method for estimating a linearly separable SVM. We consider a simple case with two features x1 , x2 . Denote vH = (w1 w2 )T and denote the full parameter vector as β = (c w1 w2 )T . The dataset of interest ”HW3Problem4-Spring2023.csv” is displayed in figure 1 and consists of n = 100 training cases with response labels y = 1 or y = -1. For a linearly separable SVM, you must solve the optimization problem:

vH(mi){IvH I}

yi (〈vH , xi）+ c) ≥ 1, for i = 1, . . . , 100

Note: We swapped “ - c” with “c” as compared to the expression presented in the class notes. The end result does not matter but it’s easier to keep track of the signs when using c.

How to solve this problem:

One method of solving the SVM minimization problem is to define a new objective Q(vH , c) as follows:

Q(vH , c) =

L(yi ,〈vH , xi）+ c) + 2 IvH I2

(L(yi ,〈vH , xi）+ c) + IvH I2(2))

Qi (vH , c),

where λ > 0 is a tuning parameter and L(y, f) is the hinge loss function defined by L(y, f) = max{0, 1 - yf}. Also note that Qi (vH , c) is the objective at case i, i.e.,

L(yi ,〈vH , xi）+ c) = L(yi , vH(T)xi + c)

= max{0, 1 - yi (vH(T)xi + c)}

= max{0, 1 - yi (w1 xi1 + w2 xi2 + c)}

= max{0, 1 - zi },

where zi = yi (w1 xi1 + w2 xi2 + c). The other component can also be simplified:

2 IvH I2 = 2 (w1 + w2 )

The goal is to minimize Q(vH , c). However, the hinge loss function g(z) = max{0, 1 - z} is not differentiable at the point z = 1. Hence the traditional gradient descent algorithm is not well defined in this exercise. Fortunately the function g(z) = max{0, 1 - z} is differentiable for every point except z = 1, and we can analytically solve the gradient for the two cases z < 1 and z > 1.

Required tasks: (4.i) - (4.iv)

4.i [5 points] Construct a plot over the the range -1 ≤ z ≤ 3 which shows shows hinge loss g(z) = max{0, 1 - z} as a function of z. This is an easy problem. Students should notice that hinge loss is not differentiable at z = 1.

4.ii [5 points] Consider the objective defined for case i: Qi (vH , c). Derive an expression for ΔQi (vH , c)

assuming zi < 1. Also derive an expression for ΔQi (vH , c) assuming zi > 1.

4.iii [10 points] Linearity of gradients allows us to compute the full gradient:

ΔQ(vH , c) = ΔQi (vH , c).

Task: Using the training data ”HW3Problem4-Spring2023.csv”, estimate the SVM model by running the gradient descent algorithm:

β (t+1) := β(t) - ηt ΔQi (β(t))

Here we denote the full parameter vector as β = (c w1 w2 )T . The key difference between the above algorithm and regular gradient descent is that you must check whether zi < 1 or zi > 1 for each case i = 1, 2, . . . , 100. I recommend choosing

ηt = and λ ≈ .25.

You are welcome to try different values of λ and compare your estimated model to the R function svm().

4.iv [5 points] Construct a plot of x1 versus x2 with your estimated linear decision boundary from part 4.iii. Also display your estimated coefficients

β(ˆ) =（ˆ(c) ˆ(w)1 ˆ(w)2 )T

A few notes follow:

(a) This particular problem is known as The Pegasos Algorithm. The development presented here is not formal but the key ideas still hold true.

(b) You must use R, Python or another programming language to manually solve problem 4.iii. Don’t just run the svm() function and call it a day!

(d) You must choose a threshold ϵ > 0 to break your algorithm or just run it for several iterations.

Problem 5 (SVM Dual Optimization, 15 points)

Consider the primal optimization problem for the SVM classifier:

v(m)c(n)∥v∥, subject to yi (⟨v, xi ⟩ − c) ≥ 1

Recall that the response values yi are labeled {−1, 1}, the vectors v, xi ∈ Rp and the norm of v is defined by ∥v∥ = ( ⟨v, v⟩ = √vTv. Notice that the above optimization problem can automatically be re-expressed as

v(m)c(n){f(v, c) }, subject to g(v, c) ≤ 0, (1)

o(w)pti(he)m(re)iza(f)n(c))pro(=)lem(∥v)∥is(2) e(a)xp(nd)rd(c))by(=) 1 − yi (⟨v, xi ⟩ − c). Using the Lagrange multiplier method, the duel convex

α(ma)x {v(m)c(n){L(v,c,α) }}, subject to αi ≥ 0, i = 1, 2, . . . ,n, (2)

where the Lagrangian is defined by

L(v,c,α) = f(v, c) + αigi (v, c).

Note that the KKT conditions are implied from the above expression, specifically from, min{L(v,c,α)} . Also note that the vector of Lagrange multipliers is denoted α = (α1 α2 ... αn )T .

Required tasks: (5.i) - (5.iii)

5.i [5 points] Derive the dual optimization objective using Lagrange multipliers as opposed to using Barycentric coordinates.

5.i [5 points]Solve the minimization problem v(m)c(n){L(v,c,λ) }. In this problem, you do not have to solve the actual optimization problem. Instead, you need to simplify Equation (2) so that it yields the famous duel convex problem for SVMs.

5.i [5 points] Use the matrix derivatives and inner product relations to simplify the expression as much as possible.