# STAT 525辅导、C/C++程序语言辅导

- 首页 >> Algorithm 算法 Multivariate methods for scale matrices

STAT 525/625

Multivariate Analysis

A. R. de Leon

Department of Mathematics and Statistics

University of Calgary

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 1/20

Principal components analysis: Motivation

Principal components analysis (PCA) is a statistical methodology often used for dimension

reduction. Given a random vector Xp×1 = (X1, · · · ,Xn)>, PCA seeks a small number p′ p (i.e.,

p′ much less than p) of linear combinations

Yk = `>k X =

p

∑

j=1

`k jX j

of the RVs X1, · · · ,Xp in X 3 the linear combinations Y1, · · · ,Yp′

1 explain “most” of the variation in X; and

2 are uncorrelated with each other (i.e., Cov(Yk,Yk′ ) = 0, ?k 6= k′, for k,k′ = 1, · · · , p′).

Application to regression analysis

Given predictors X1, · · · ,X200, where the first 100 predictors X1, · · · ,X100 pertain to Industry 1 and

the remaining 100 predictors X101, · · · ,X200 to Industry 2, suppose the first 2 principal components

(PCs) are based on the coefficient vectors `1 = 1√2001200 and `2 =

1√

200

(

1>100,?1>100)>. These

suggest the following transformed predictors:

X?1 =

1

200

1>200X and X?2 =

1

100

(

1>100,?1>100

)

X,

so that X?1 may be interpreted as an average over industries and X?2 as the difference in industry

averages, so that p′ = 2 200 = p. Because PCs are uncorrelated, PCA is one remedy for the

problem of multicollinearity in regression analysis.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 2/20

Population principal components from Cov(X) =Σ

Let μ= E(X) and Σ=Cov(X) 0. The kth principal component (PC) Yk = `>k X, with ‖`k‖= 1,

is that linear combination 3 Var(Yk)≥Var(Yk′ ) and Cov(Yk,Yk′ ) = 0, ?k> k′, for k,k′ = 1, · · · , p. By

the Spectral Decomposition Theorem (SDT; see Equation (2–18) on p. 61 of JW), we have

Σ = VDλV>, where Dλ = diag(λ1, · · · ,λp), 3 λ1 ≥ ·· · ≥ λp > 0 are the ordered eigenvalues of

Σ, and V = (v1, · · · ,vp) is the orthogonal matrix with columns the corresponding orthonormal

eigenvectors v1, · · · ,vp (i.e., v j ⊥ v j′ , ? j 6= j′, and ‖v j‖= 1, ? j) of Σ.

For Yk = `>k X, we have

Var(Yk) = `>k Σ`k =

p

∑

j=1

α2k jλ j ≥ λ1,

where αk = (αk1, · · · ,αkp)> =V>`k and ‖αk‖= 1, k= 1, · · · , p. For k= 1, we see that Var(Y1) = λ1

iff α1 = (1,0, · · · ,0)>, so that we get `1 =Vα1 = v1. Hence, the first population PC Y1 is given

by

Y1 = v>1 X and Var(Y1) = λ1.

For k = 2, we have the second population PC Y2 = `>2X 3 Cov(Y1,Y2) = `>2 Σv1 = λ1α21, so that

α2 = (0,α22, · · · ,α2p)> and ‖α2‖2 = ∑pj=2α22 j. In addition, we want α2 to maximize

Var(Y2) =

p

∑

j=2

α22 jλ j ≤ λ2,

and clearly, Var(Y2) = λ2 iff α2 = (0,1,0, · · · ,0)>. Hence, we get `2 =Vα2 = v2, and the second

population PC is given by Y2 = v>2 X and Var(Y2) = λ2.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 3/20

Properties of population principal components

Population principal components Y1, · · · ,Yp

The kth population PC Yk from Σ is given by Yk = v>k X, 3 Var(Yk) = λk and Cov(Yk,Yk′ ) = 0,

?k′ 6= k, where vk is the orthonormal eigenvector corresponding to the kth largest eigenvalue λk

of Σ, for k = 1, · · · , p.

1 Information in X. Observe that

Yp×1 =

??? Y1...

Yp

??? = V>X

is an orthogonal transformation of X, so that X =VY . That is, Y contains the same

information as X does, but we hope that the first few PCs, say, p′ p, would suffice. We

can choose p′ 3 the total variance Var(1>pY ) = ∑pk=1Var(Yk) = tr(Σ) = ∑pk=1 λk ≈ ∑p

′

k=1 λk.

2 Correlation between Yk and X j. For vk = (vk1, · · · ,vkp)>, we have

Corr(X j,Yk) = 1σ j vk j

√

λk 3 ∑pj=1

(

Corr(X j,Yk)

)2

= 1,

for k, j = 1, · · · , p, where Var(X j) = σ2j is the jth diagonal element of Σ.

3 Normality of Yk. If X ～ Np(μ,Σ), then Y ～ Np(V>μ,Dλ ) (i.e., Yk ind～ N(v>k μ,λk), for

k = 1, · · · , p).

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 4/20

Population principal components from Corr(X) =P

When the variables X1, · · · ,Xp are in different units, or if they have highly disparate variances,

their linear combinations may prove to be difficult to interpret. That is, the population PCs

Y1, · · · ,Yp from Σ may not be practically meaningful! A common remedy in applications is to

work, not with X, but with its “standardized” version Zp×1 given by

Z = D?1/2σ

(

X?μ),

where Dσ = diag(σ21 , · · · ,σ2p), with σ21 , · · · ,σ2p the diagonal elements of Σ. Note that

Cov(Z) = D?1/2σ ΣD

?1/2

σ = Corr(X) = P,

where P is the correlation matrix of X 3 tr(P) = p.

Population principal components Y?1, · · · ,Y?p

The kth population PC Y?k from P is given by Y?k = v?>k Z, 3 Var(Y?k) = λ?k and Cov(Y?k,Y?k′ ) = 0,

?k′ 6= k, where v?k is the orthonormal eigenvector corresponding to the kth largest eigenvalue λ?k

of P, for k = 1, · · · , p.

Note that the population PCs are not invariant to the choice of scale matrix. That is, it is

possible that the population PCs Y1, · · · ,Yp from Σ are very different from the population PCs

Y?1, · · · ,Y?p from P. See Example 8.2 on p. 437 of JW for an illustration.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 5/20

Sample principal components

LetX1, · · · ,Xn rs～Np(μ,Σ), with Σ 0. Since S 0 wp 1, we have by SDT that S= V?D?λ V?

>, where

D?λ = diag(λ?1, · · · , λ?p), with λ?1 ≥ ·· · ≥ λ?p > 0 the ordered eigenvalues of S and V? = (v?1, · · · , v?p) the

orthogonal matrix with columns the corresponding orthonormal eigenvectors v?1, · · · , v?p.

For k = 1, let ?1 =

(

?11, · · · ,?1n

)> 3 ?1i = ?`>1Xi, for i= 1, · · · ,n, where ?`1 satisfies the following:

1 ‖ ?`1‖= 1; and

2 ?`1 = v?1 maximizes the “sample variance” ?`>1 S ?`1, 3 ?`>1 S ?`1∣∣ ?`1=v?1 = λ?1.

We thus have ?1 =Xn×pv?1, the n× 1 vector of the scores on first sample PC (i.e., score ?1i is

“sample observation” i on Y1), with X the data matrix. Alternatively, the data matrix X can

be standardized first as Zn×p = (Z1, · · · ,Zn)> =

(

X?1n?X>

)

diag

(

1

S1

, · · · , 1Sp

)

, so that we get

the sample correlation matrix R (based on X) given by R = 1n?2Z

>Z = SZ , the sample variance-

covariance matrix of Z1, · · · ,Zn (i.e., based on Z), since Z = 1n ∑ni=1Zi = 0p×1. The vector

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 6/20

Stock prices data (Table 8.4, p. 473 of JW)

Data on weekly rates of return of stocks of JP Morgan, Citibank, Wells Fargo, (Royal Dutch)

Shell, and Exxon (Mobil). The R function prcomp() was used to obtain the following results:

Table 1. The estimated coefficient vectors v?k and sample variances λ?k (i.e., kth

eigenvalue-eigenvector pair of S) of scores for the kth sample PC, for k = 1, · · · ,5, from

the sample variance-covariance matrix S of the stock prices data.

Stock Coefficients (v?k)

PC1 PC2 PC3 PC4 PC5

JP Morgan –0.469 0.368 –0.604 –0.363 0.384

Citibank –0.532 0.236 –0.136 –0.629 –0.496

Wells Fargo –0.465 0.315 0.772 0.289 0.071

Shell –0.387 –0.585 0.093 –0.381 0.595

Exxon –0.361 –0.606 –0.109 0.493 –0.498

λ?k 2.437 1.406 0.499 0.399 0.255

For i= 1, · · · ,103, we get the scores for the first and second sample PCs as

?1i = ?0.469Xi1?0.532Xi2?0.465Xi3?0.387Xi4?0.361Xi5,

?2i = 0.368Xi1+0.236Xi2+0.315Xi3?0.585Xi4?0.605Xi5.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 7/20

Scatterplot of scores ?1i and ?2i and scree plot of λ?ks

Note that ?1i

a∝ 151

>

5Xi, so that sample PC1 can be interpreted as the stock average while sample

PC2 can be viewed as a contrast between financial and oil stocks.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 8/20

Normality of scores ?1i and ?2i for PC1 and PC2

We have √n(λ??λ) D→ Np(0p×1,2D2λ ), as n→+∞, where λ= (λ1, · · · ,λp)> and λ?= (λ?1, · · · , λ?p)>.

Using the delta method, we get log

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 9/20

Police overtime data (Table 5.8, p. 240 of JW)

Data on overtime hours for the police department are classified according to “court appear-

ances” (legal), “extraordinary appearances” (extra), “holdover hours” (holdover), “COA hours”

(COA), and “meeting hours” (meeting). A PCA on the centred data yilds the following:

Table 2. The estimated coefficient vectors v?k and sample variances λ?k (i.e., kth

eigenvalue-eigenvector pair of S) of scores for the kth sample PC, for k = 1, · · · ,5, from

the sample variance-covariance matrix S of the centred police overtime data.

Overtime Coefficients (v?k)

PC1 PC2 PC3 PC4 PC5

legal 0.046 –0.048 0.629 –0.643 0.432

extra 0.039 0.985 –0.077 –0.151 –0.007

holdover –0.658 0.107 0.582 0.250 –0.392

COA 0.734 0.069 0.503 0.397 –0.213

meeting –0.155 0.107 0.081 0.586 0.784√

λ?k 1664.4 1195.5 792.5 470.2 315.9

Under multivariate normality of X, the centred PCs Yk = v>k (X?μ)～ N(0,λk), for k = 1, · · · , p,

so that ∑p

′

k=1

1

λ?k

? 2ki

a～ χ2p′ , provided n is large.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 10/20

Scatterplot of scores ?1i and ?2i and scree plot of λ?ks from centred S

Observation #11 had an unusually high “extra” value, hence, an unusually high value of ?2,11,

which is dominated by “extra”.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 11/20

T 2-control chart based on PCA on centred S

The sum ∑pk=p′+1 of squares of the last few standardized PC scores can be used to identify

out-of-control observations. Note that observations #12 and #13 fall outside or are near the

upper control limit.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 12/20

PCA on sample correlation matrix R of police overtime data

A PCA on the sample correlation matrix R of the police overtime data yields the following:

Table 3. The estimated coefficient vectors ??vk and sample variances ??λ k (i.e., kth

eigenvalue-eigenvector pair of R) of scores for the kth sample PC, for k = 1, · · · ,5, from

the sample correlation matrix R of the police overtime data.

Overtime Coefficients (vk)

PC1 PC2 PC3 PC4 PC5

legal 0.156 –0.648 0.660 –0.064 0.339

extra –0.074 0.677 0.607 0.404 0.069

holdover –0.599 –0.286 0.199 0.189 –0.695

COA 0.576 0.114 0.318 –0.466 –0.579

meeting –0.528 0.163 0.233 –0.761 0.247

λ k 2.163 1.146 1.024 0.522 0.144

The results are generally similar, albeit less clearcut. One issue here is that the asymptotic

approximations may not apply as well to the sample correlation matrix R as they do to the

sample covariance matrix S.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 13/20

Scatterplot of scores ?1i and ?2i and scree plot of λ?ks from R

While observation #11 still appears to be outlying, notice that it does not fall very far from the

boundary of the confidence ellipse.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 14/20

T 2-control chart based on PCA on R

Note that only observation #13 appears — and only barely — to be “out-of-control”.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 15/20

Factor analysis: Modelling the covariance structure

Factor analysis (FA) seeks to explain the high correlations between variables within groups of

variables in terms of a small number of latent (i.e. unobservable) random factors.

Orthogonal factor model with m common factors

Let Xp×1 be an observable random vector 3 E(X) =μp×1 and Cov(X) =Σp×p 0. The orthog-

onal factor model assumes that

X = μ+LF +,

where Fm×1 and p×1 are independent 3 m ≤ p, with E(F ) = 0m×1, E() = 0p×1, Cov(F ) = Im

(hence, orthogonal factors), and Cov() = Ψp×p = diag(ψ1, · · · ,ψp) 0. We have

Fk, k = 1, · · · ,m : kth common factor,

ε j, j = 1, · · · , p : jth specific factor,

` jk, j = 1, · · · , p, k = 1, · · · ,m : ( j,k)th factor loading,

ψ j, j = 1, · · · , p : jth “uniqueness” or specific variance.

The matrix Lp×m is the matrix of (common) factor loadings, with ( j,k)th element ` jk representing

variable X j’s loading on common factor Fk.

Note that the (orthogonal) factor model is similar to a regression model, except that the “design

matrix” L is not observable (i.e., unknown).

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 16/20

Structural model for Σ

The orthogonal factor model implies a structural model for Σ =

(

σ j j′

)

p×p given by

Σp×p = LL>+Ψ and Cov(X,F ) = L.

That is, Cov(X j,Fk) = ` jk, ? j,k, and

Var(X j) = σ j j = σ2j =

m

∑

k=1

`2jk +ψ j = h

2

j +ψ j, Cov(X j,X j′ ) = σ j j′ =

m

∑

k=1

` jk` j′k = h j j′ ,

where LL> =

(

h j j′

)

p×p, with diagonals h j j = h

2

j > 0 referred to as the communalities, which are

the variance components due to the factors. A simple example of a factorization of Σ3×3 (i.e.,

p= 3), with m= 1 common factor and L = 13, is

Σ = LL>+Ψ = J3+diag(ψ1,ψ2,ψ3).

Note that while we can always find L for m= p (with Ψ = 0), a “proper solution” (i.e., h j j′ > 0

and ψ j > 0, ? j, j′) may not exist when m p. See Example 9.2 on p. 486 of JW.

Scale invariance. Yp×1 =Cp×pX, with C a diagonal matrix (so that Cov(C) =CΨC> is

also diagonal), also satisfies the orthogonal factor model.

Factor rotation. For orthogonal matrix Γm×m, we have

Σ = LΓ(LΓ)>+Ψ = L?(L?)>+Ψ,

so that “rotating” L preserves the structural model for Σ. Since Γ is not unique, a

particular rotation is chosen based on interpretability of the common factors.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 17/20

Principal components solution

By SDT, we have Σ =VDλV> = ∑pj=1 λ jv jv>j , we have

Σ ≈

m

∑

k=1

λkvkv>k = L?L?

>

,

where

L? =

(√

λ1v1, · · · ,

√

λmvm

)

.

The matrix of communalities Ψ? = diag(ψ?1, · · · , ψ?m) is then taken as

Ψ? = diag

orthogonal matrix with columns the orthonormal eigenvectors corresponding to λ?1, · · · , λ?p. The

residual matrix êS is given by.

The above estimation method is often referred to as the principal factor method.

The sample variance-covariance matrix S may be replaced by the sample correlation matrix R.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 18/20

What should m be?

Deciding the value m is similar to deciding the number of PCs to be retained in PCA. Two common

approaches are as follows:

1 Via Frobenius norm. With the Frobenius norm

‖A‖2F = tr(A>A) = ‖vec(A)‖2,

we have

so that we can choose m 3 ∑pj=m+1 λ? 2j is small.

2 Via total contribution. Since ??h2j = ∑mk′=1 ?`?2jk′ , common factor Fk contributes ?`?2jk to sample

variance S2j of variable X j. Hence, the total contribution of common factor Fk to the total

sample variance tr(S) = ∑pj=1 S2j is

tr(L) ≈ 1 (i.e., ∑mk=1 λ?k ≈ tr(L)).

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 19/20

Maximum likelihood solution

Let X1, · · · ,Xn rs～ Np(μ,Σ) (i.e., Fi and i are MVN random vectors). With Σ =LL>+Ψ, the

likelihood function L(μ,L,Ψ) is given by

L(μ,L,Ψ) ∝ |Σ|np/2 exp

(he MLEs μ?=X, L, and Ψ? maximize L(μ,L,Ψ), subject to the constraint

L>Ψ?1L = ? = diag(δ1, · · · ,δm),

where δ1 > · · ·> δm. This yields an L, which is unique up to the multiplication of its columns by ±1.

Note that the orthogonal factor model is overparameterized, which renders it non-identifiable.

As a remedy, the above constraint is imposed on the model.

Sketch of proof of “uniqueness” of L

If Kp×m 3

LL> = KK>,

then K =LA, for some orthogonal matrix A. By the Singular Value Decomposition Theorem,

it can be shown that

A = A,

whence, A is a diagonal matrix with diagonal elements ±1.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 20/20

STAT 525/625

Multivariate Analysis

A. R. de Leon

Department of Mathematics and Statistics

University of Calgary

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 1/20

Principal components analysis: Motivation

Principal components analysis (PCA) is a statistical methodology often used for dimension

reduction. Given a random vector Xp×1 = (X1, · · · ,Xn)>, PCA seeks a small number p′ p (i.e.,

p′ much less than p) of linear combinations

Yk = `>k X =

p

∑

j=1

`k jX j

of the RVs X1, · · · ,Xp in X 3 the linear combinations Y1, · · · ,Yp′

1 explain “most” of the variation in X; and

2 are uncorrelated with each other (i.e., Cov(Yk,Yk′ ) = 0, ?k 6= k′, for k,k′ = 1, · · · , p′).

Application to regression analysis

Given predictors X1, · · · ,X200, where the first 100 predictors X1, · · · ,X100 pertain to Industry 1 and

the remaining 100 predictors X101, · · · ,X200 to Industry 2, suppose the first 2 principal components

(PCs) are based on the coefficient vectors `1 = 1√2001200 and `2 =

1√

200

(

1>100,?1>100)>. These

suggest the following transformed predictors:

X?1 =

1

200

1>200X and X?2 =

1

100

(

1>100,?1>100

)

X,

so that X?1 may be interpreted as an average over industries and X?2 as the difference in industry

averages, so that p′ = 2 200 = p. Because PCs are uncorrelated, PCA is one remedy for the

problem of multicollinearity in regression analysis.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 2/20

Population principal components from Cov(X) =Σ

Let μ= E(X) and Σ=Cov(X) 0. The kth principal component (PC) Yk = `>k X, with ‖`k‖= 1,

is that linear combination 3 Var(Yk)≥Var(Yk′ ) and Cov(Yk,Yk′ ) = 0, ?k> k′, for k,k′ = 1, · · · , p. By

the Spectral Decomposition Theorem (SDT; see Equation (2–18) on p. 61 of JW), we have

Σ = VDλV>, where Dλ = diag(λ1, · · · ,λp), 3 λ1 ≥ ·· · ≥ λp > 0 are the ordered eigenvalues of

Σ, and V = (v1, · · · ,vp) is the orthogonal matrix with columns the corresponding orthonormal

eigenvectors v1, · · · ,vp (i.e., v j ⊥ v j′ , ? j 6= j′, and ‖v j‖= 1, ? j) of Σ.

For Yk = `>k X, we have

Var(Yk) = `>k Σ`k =

p

∑

j=1

α2k jλ j ≥ λ1,

where αk = (αk1, · · · ,αkp)> =V>`k and ‖αk‖= 1, k= 1, · · · , p. For k= 1, we see that Var(Y1) = λ1

iff α1 = (1,0, · · · ,0)>, so that we get `1 =Vα1 = v1. Hence, the first population PC Y1 is given

by

Y1 = v>1 X and Var(Y1) = λ1.

For k = 2, we have the second population PC Y2 = `>2X 3 Cov(Y1,Y2) = `>2 Σv1 = λ1α21, so that

α2 = (0,α22, · · · ,α2p)> and ‖α2‖2 = ∑pj=2α22 j. In addition, we want α2 to maximize

Var(Y2) =

p

∑

j=2

α22 jλ j ≤ λ2,

and clearly, Var(Y2) = λ2 iff α2 = (0,1,0, · · · ,0)>. Hence, we get `2 =Vα2 = v2, and the second

population PC is given by Y2 = v>2 X and Var(Y2) = λ2.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 3/20

Properties of population principal components

Population principal components Y1, · · · ,Yp

The kth population PC Yk from Σ is given by Yk = v>k X, 3 Var(Yk) = λk and Cov(Yk,Yk′ ) = 0,

?k′ 6= k, where vk is the orthonormal eigenvector corresponding to the kth largest eigenvalue λk

of Σ, for k = 1, · · · , p.

1 Information in X. Observe that

Yp×1 =

??? Y1...

Yp

??? = V>X

is an orthogonal transformation of X, so that X =VY . That is, Y contains the same

information as X does, but we hope that the first few PCs, say, p′ p, would suffice. We

can choose p′ 3 the total variance Var(1>pY ) = ∑pk=1Var(Yk) = tr(Σ) = ∑pk=1 λk ≈ ∑p

′

k=1 λk.

2 Correlation between Yk and X j. For vk = (vk1, · · · ,vkp)>, we have

Corr(X j,Yk) = 1σ j vk j

√

λk 3 ∑pj=1

(

Corr(X j,Yk)

)2

= 1,

for k, j = 1, · · · , p, where Var(X j) = σ2j is the jth diagonal element of Σ.

3 Normality of Yk. If X ～ Np(μ,Σ), then Y ～ Np(V>μ,Dλ ) (i.e., Yk ind～ N(v>k μ,λk), for

k = 1, · · · , p).

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 4/20

Population principal components from Corr(X) =P

When the variables X1, · · · ,Xp are in different units, or if they have highly disparate variances,

their linear combinations may prove to be difficult to interpret. That is, the population PCs

Y1, · · · ,Yp from Σ may not be practically meaningful! A common remedy in applications is to

work, not with X, but with its “standardized” version Zp×1 given by

Z = D?1/2σ

(

X?μ),

where Dσ = diag(σ21 , · · · ,σ2p), with σ21 , · · · ,σ2p the diagonal elements of Σ. Note that

Cov(Z) = D?1/2σ ΣD

?1/2

σ = Corr(X) = P,

where P is the correlation matrix of X 3 tr(P) = p.

Population principal components Y?1, · · · ,Y?p

The kth population PC Y?k from P is given by Y?k = v?>k Z, 3 Var(Y?k) = λ?k and Cov(Y?k,Y?k′ ) = 0,

?k′ 6= k, where v?k is the orthonormal eigenvector corresponding to the kth largest eigenvalue λ?k

of P, for k = 1, · · · , p.

Note that the population PCs are not invariant to the choice of scale matrix. That is, it is

possible that the population PCs Y1, · · · ,Yp from Σ are very different from the population PCs

Y?1, · · · ,Y?p from P. See Example 8.2 on p. 437 of JW for an illustration.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 5/20

Sample principal components

LetX1, · · · ,Xn rs～Np(μ,Σ), with Σ 0. Since S 0 wp 1, we have by SDT that S= V?D?λ V?

>, where

D?λ = diag(λ?1, · · · , λ?p), with λ?1 ≥ ·· · ≥ λ?p > 0 the ordered eigenvalues of S and V? = (v?1, · · · , v?p) the

orthogonal matrix with columns the corresponding orthonormal eigenvectors v?1, · · · , v?p.

For k = 1, let ?1 =

(

?11, · · · ,?1n

)> 3 ?1i = ?`>1Xi, for i= 1, · · · ,n, where ?`1 satisfies the following:

1 ‖ ?`1‖= 1; and

2 ?`1 = v?1 maximizes the “sample variance” ?`>1 S ?`1, 3 ?`>1 S ?`1∣∣ ?`1=v?1 = λ?1.

We thus have ?1 =Xn×pv?1, the n× 1 vector of the scores on first sample PC (i.e., score ?1i is

“sample observation” i on Y1), with X the data matrix. Alternatively, the data matrix X can

be standardized first as Zn×p = (Z1, · · · ,Zn)> =

(

X?1n?X>

)

diag

(

1

S1

, · · · , 1Sp

)

, so that we get

the sample correlation matrix R (based on X) given by R = 1n?2Z

>Z = SZ , the sample variance-

covariance matrix of Z1, · · · ,Zn (i.e., based on Z), since Z = 1n ∑ni=1Zi = 0p×1. The vector

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 6/20

Stock prices data (Table 8.4, p. 473 of JW)

Data on weekly rates of return of stocks of JP Morgan, Citibank, Wells Fargo, (Royal Dutch)

Shell, and Exxon (Mobil). The R function prcomp() was used to obtain the following results:

Table 1. The estimated coefficient vectors v?k and sample variances λ?k (i.e., kth

eigenvalue-eigenvector pair of S) of scores for the kth sample PC, for k = 1, · · · ,5, from

the sample variance-covariance matrix S of the stock prices data.

Stock Coefficients (v?k)

PC1 PC2 PC3 PC4 PC5

JP Morgan –0.469 0.368 –0.604 –0.363 0.384

Citibank –0.532 0.236 –0.136 –0.629 –0.496

Wells Fargo –0.465 0.315 0.772 0.289 0.071

Shell –0.387 –0.585 0.093 –0.381 0.595

Exxon –0.361 –0.606 –0.109 0.493 –0.498

λ?k 2.437 1.406 0.499 0.399 0.255

For i= 1, · · · ,103, we get the scores for the first and second sample PCs as

?1i = ?0.469Xi1?0.532Xi2?0.465Xi3?0.387Xi4?0.361Xi5,

?2i = 0.368Xi1+0.236Xi2+0.315Xi3?0.585Xi4?0.605Xi5.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 7/20

Scatterplot of scores ?1i and ?2i and scree plot of λ?ks

Note that ?1i

a∝ 151

>

5Xi, so that sample PC1 can be interpreted as the stock average while sample

PC2 can be viewed as a contrast between financial and oil stocks.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 8/20

Normality of scores ?1i and ?2i for PC1 and PC2

We have √n(λ??λ) D→ Np(0p×1,2D2λ ), as n→+∞, where λ= (λ1, · · · ,λp)> and λ?= (λ?1, · · · , λ?p)>.

Using the delta method, we get log

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 9/20

Police overtime data (Table 5.8, p. 240 of JW)

Data on overtime hours for the police department are classified according to “court appear-

ances” (legal), “extraordinary appearances” (extra), “holdover hours” (holdover), “COA hours”

(COA), and “meeting hours” (meeting). A PCA on the centred data yilds the following:

Table 2. The estimated coefficient vectors v?k and sample variances λ?k (i.e., kth

eigenvalue-eigenvector pair of S) of scores for the kth sample PC, for k = 1, · · · ,5, from

the sample variance-covariance matrix S of the centred police overtime data.

Overtime Coefficients (v?k)

PC1 PC2 PC3 PC4 PC5

legal 0.046 –0.048 0.629 –0.643 0.432

extra 0.039 0.985 –0.077 –0.151 –0.007

holdover –0.658 0.107 0.582 0.250 –0.392

COA 0.734 0.069 0.503 0.397 –0.213

meeting –0.155 0.107 0.081 0.586 0.784√

λ?k 1664.4 1195.5 792.5 470.2 315.9

Under multivariate normality of X, the centred PCs Yk = v>k (X?μ)～ N(0,λk), for k = 1, · · · , p,

so that ∑p

′

k=1

1

λ?k

? 2ki

a～ χ2p′ , provided n is large.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 10/20

Scatterplot of scores ?1i and ?2i and scree plot of λ?ks from centred S

Observation #11 had an unusually high “extra” value, hence, an unusually high value of ?2,11,

which is dominated by “extra”.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 11/20

T 2-control chart based on PCA on centred S

The sum ∑pk=p′+1 of squares of the last few standardized PC scores can be used to identify

out-of-control observations. Note that observations #12 and #13 fall outside or are near the

upper control limit.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 12/20

PCA on sample correlation matrix R of police overtime data

A PCA on the sample correlation matrix R of the police overtime data yields the following:

Table 3. The estimated coefficient vectors ??vk and sample variances ??λ k (i.e., kth

eigenvalue-eigenvector pair of R) of scores for the kth sample PC, for k = 1, · · · ,5, from

the sample correlation matrix R of the police overtime data.

Overtime Coefficients (vk)

PC1 PC2 PC3 PC4 PC5

legal 0.156 –0.648 0.660 –0.064 0.339

extra –0.074 0.677 0.607 0.404 0.069

holdover –0.599 –0.286 0.199 0.189 –0.695

COA 0.576 0.114 0.318 –0.466 –0.579

meeting –0.528 0.163 0.233 –0.761 0.247

λ k 2.163 1.146 1.024 0.522 0.144

The results are generally similar, albeit less clearcut. One issue here is that the asymptotic

approximations may not apply as well to the sample correlation matrix R as they do to the

sample covariance matrix S.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 13/20

Scatterplot of scores ?1i and ?2i and scree plot of λ?ks from R

While observation #11 still appears to be outlying, notice that it does not fall very far from the

boundary of the confidence ellipse.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 14/20

T 2-control chart based on PCA on R

Note that only observation #13 appears — and only barely — to be “out-of-control”.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 15/20

Factor analysis: Modelling the covariance structure

Factor analysis (FA) seeks to explain the high correlations between variables within groups of

variables in terms of a small number of latent (i.e. unobservable) random factors.

Orthogonal factor model with m common factors

Let Xp×1 be an observable random vector 3 E(X) =μp×1 and Cov(X) =Σp×p 0. The orthog-

onal factor model assumes that

X = μ+LF +,

where Fm×1 and p×1 are independent 3 m ≤ p, with E(F ) = 0m×1, E() = 0p×1, Cov(F ) = Im

(hence, orthogonal factors), and Cov() = Ψp×p = diag(ψ1, · · · ,ψp) 0. We have

Fk, k = 1, · · · ,m : kth common factor,

ε j, j = 1, · · · , p : jth specific factor,

` jk, j = 1, · · · , p, k = 1, · · · ,m : ( j,k)th factor loading,

ψ j, j = 1, · · · , p : jth “uniqueness” or specific variance.

The matrix Lp×m is the matrix of (common) factor loadings, with ( j,k)th element ` jk representing

variable X j’s loading on common factor Fk.

Note that the (orthogonal) factor model is similar to a regression model, except that the “design

matrix” L is not observable (i.e., unknown).

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 16/20

Structural model for Σ

The orthogonal factor model implies a structural model for Σ =

(

σ j j′

)

p×p given by

Σp×p = LL>+Ψ and Cov(X,F ) = L.

That is, Cov(X j,Fk) = ` jk, ? j,k, and

Var(X j) = σ j j = σ2j =

m

∑

k=1

`2jk +ψ j = h

2

j +ψ j, Cov(X j,X j′ ) = σ j j′ =

m

∑

k=1

` jk` j′k = h j j′ ,

where LL> =

(

h j j′

)

p×p, with diagonals h j j = h

2

j > 0 referred to as the communalities, which are

the variance components due to the factors. A simple example of a factorization of Σ3×3 (i.e.,

p= 3), with m= 1 common factor and L = 13, is

Σ = LL>+Ψ = J3+diag(ψ1,ψ2,ψ3).

Note that while we can always find L for m= p (with Ψ = 0), a “proper solution” (i.e., h j j′ > 0

and ψ j > 0, ? j, j′) may not exist when m p. See Example 9.2 on p. 486 of JW.

Scale invariance. Yp×1 =Cp×pX, with C a diagonal matrix (so that Cov(C) =CΨC> is

also diagonal), also satisfies the orthogonal factor model.

Factor rotation. For orthogonal matrix Γm×m, we have

Σ = LΓ(LΓ)>+Ψ = L?(L?)>+Ψ,

so that “rotating” L preserves the structural model for Σ. Since Γ is not unique, a

particular rotation is chosen based on interpretability of the common factors.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 17/20

Principal components solution

By SDT, we have Σ =VDλV> = ∑pj=1 λ jv jv>j , we have

Σ ≈

m

∑

k=1

λkvkv>k = L?L?

>

,

where

L? =

(√

λ1v1, · · · ,

√

λmvm

)

.

The matrix of communalities Ψ? = diag(ψ?1, · · · , ψ?m) is then taken as

Ψ? = diag

orthogonal matrix with columns the orthonormal eigenvectors corresponding to λ?1, · · · , λ?p. The

residual matrix êS is given by.

The above estimation method is often referred to as the principal factor method.

The sample variance-covariance matrix S may be replaced by the sample correlation matrix R.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 18/20

What should m be?

Deciding the value m is similar to deciding the number of PCs to be retained in PCA. Two common

approaches are as follows:

1 Via Frobenius norm. With the Frobenius norm

‖A‖2F = tr(A>A) = ‖vec(A)‖2,

we have

so that we can choose m 3 ∑pj=m+1 λ? 2j is small.

2 Via total contribution. Since ??h2j = ∑mk′=1 ?`?2jk′ , common factor Fk contributes ?`?2jk to sample

variance S2j of variable X j. Hence, the total contribution of common factor Fk to the total

sample variance tr(S) = ∑pj=1 S2j is

tr(L) ≈ 1 (i.e., ∑mk=1 λ?k ≈ tr(L)).

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 19/20

Maximum likelihood solution

Let X1, · · · ,Xn rs～ Np(μ,Σ) (i.e., Fi and i are MVN random vectors). With Σ =LL>+Ψ, the

likelihood function L(μ,L,Ψ) is given by

L(μ,L,Ψ) ∝ |Σ|np/2 exp

(he MLEs μ?=X, L, and Ψ? maximize L(μ,L,Ψ), subject to the constraint

L>Ψ?1L = ? = diag(δ1, · · · ,δm),

where δ1 > · · ·> δm. This yields an L, which is unique up to the multiplication of its columns by ±1.

Note that the orthogonal factor model is overparameterized, which renders it non-identifiable.

As a remedy, the above constraint is imposed on the model.

Sketch of proof of “uniqueness” of L

If Kp×m 3

LL> = KK>,

then K =LA, for some orthogonal matrix A. By the Singular Value Decomposition Theorem,

it can be shown that

A = A,

whence, A is a diagonal matrix with diagonal elements ±1.

Multivariate Analysis AR de Leon Lecture 3 Fall 2022 20/20