STAT 525辅导、C/C++程序语言辅导

2022.11.26 - 首页 >> Algorithm 算法

Multivariate methods for scale matrices
STAT 525/625
Multivariate Analysis
A. R. de Leon
Department of Mathematics and Statistics
University of Calgary
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 1/20
Principal components analysis: Motivation
Principal components analysis (PCA) is a statistical methodology often used for dimension
reduction. Given a random vector Xp×1 = (X1, · · · ,Xn)>, PCA seeks a small number p′ p (i.e.,
p′ much less than p) of linear combinations
Yk = `>k X =
p
∑
j=1
`k jX j
of the RVs X1, · · · ,Xp in X 3 the linear combinations Y1, · · · ,Yp′
1 explain “most” of the variation in X; and
2 are uncorrelated with each other (i.e., Cov(Yk,Yk′ ) = 0, ?k 6= k′, for k,k′ = 1, · · · , p′).
Application to regression analysis
Given predictors X1, · · · ,X200, where the first 100 predictors X1, · · · ,X100 pertain to Industry 1 and
the remaining 100 predictors X101, · · · ,X200 to Industry 2, suppose the first 2 principal components
(PCs) are based on the coefficient vectors `1 = 1√2001200 and `2 =
1√
200
(
1>100,?1>100)>. These
suggest the following transformed predictors:
X?1 =
1
200
1>200X and X?2 =
1
100
(
1>100,?1>100
)
X,
so that X?1 may be interpreted as an average over industries and X?2 as the difference in industry
averages, so that p′ = 2 200 = p. Because PCs are uncorrelated, PCA is one remedy for the
problem of multicollinearity in regression analysis.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 2/20
Population principal components from Cov(X) =Σ
Let μ= E(X) and Σ=Cov(X) 0. The kth principal component (PC) Yk = `>k X, with ‖`k‖= 1,
is that linear combination 3 Var(Yk)≥Var(Yk′ ) and Cov(Yk,Yk′ ) = 0, ?k> k′, for k,k′ = 1, · · · , p. By
the Spectral Decomposition Theorem (SDT; see Equation (2–18) on p. 61 of JW), we have
Σ = VDλV>, where Dλ = diag(λ1, · · · ,λp), 3 λ1 ≥ ·· · ≥ λp > 0 are the ordered eigenvalues of
Σ, and V = (v1, · · · ,vp) is the orthogonal matrix with columns the corresponding orthonormal
eigenvectors v1, · · · ,vp (i.e., v j ⊥ v j′ , ? j 6= j′, and ‖v j‖= 1, ? j) of Σ.
For Yk = `>k X, we have
Var(Yk) = `>k Σ`k =
p
∑
j=1
α2k jλ j ≥ λ1,
where αk = (αk1, · · · ,αkp)> =V>`k and ‖αk‖= 1, k= 1, · · · , p. For k= 1, we see that Var(Y1) = λ1
iff α1 = (1,0, · · · ,0)>, so that we get `1 =Vα1 = v1. Hence, the first population PC Y1 is given
by
Y1 = v>1 X and Var(Y1) = λ1.
For k = 2, we have the second population PC Y2 = `>2X 3 Cov(Y1,Y2) = `>2 Σv1 = λ1α21, so that
α2 = (0,α22, · · · ,α2p)> and ‖α2‖2 = ∑pj=2α22 j. In addition, we want α2 to maximize
Var(Y2) =
p
∑
j=2
α22 jλ j ≤ λ2,
and clearly, Var(Y2) = λ2 iff α2 = (0,1,0, · · · ,0)>. Hence, we get `2 =Vα2 = v2, and the second
population PC is given by Y2 = v>2 X and Var(Y2) = λ2.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 3/20
Properties of population principal components
Population principal components Y1, · · · ,Yp
The kth population PC Yk from Σ is given by Yk = v>k X, 3 Var(Yk) = λk and Cov(Yk,Yk′ ) = 0,
?k′ 6= k, where vk is the orthonormal eigenvector corresponding to the kth largest eigenvalue λk
of Σ, for k = 1, · · · , p.
1 Information in X. Observe that
Yp×1 =
??? Y1...
Yp
??? = V>X
is an orthogonal transformation of X, so that X =VY . That is, Y contains the same
information as X does, but we hope that the first few PCs, say, p′ p, would suffice. We
can choose p′ 3 the total variance Var(1>pY ) = ∑pk=1Var(Yk) = tr(Σ) = ∑pk=1 λk ≈ ∑p
′
k=1 λk.
2 Correlation between Yk and X j. For vk = (vk1, · · · ,vkp)>, we have
Corr(X j,Yk) = 1σ j vk j
√
λk 3 ∑pj=1
(
Corr(X j,Yk)
)2
= 1,
for k, j = 1, · · · , p, where Var(X j) = σ2j is the jth diagonal element of Σ.
3 Normality of Yk. If X ～ Np(μ,Σ), then Y ～ Np(V>μ,Dλ ) (i.e., Yk ind～ N(v>k μ,λk), for
k = 1, · · · , p).
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 4/20
Population principal components from Corr(X) =P
When the variables X1, · · · ,Xp are in different units, or if they have highly disparate variances,
their linear combinations may prove to be difficult to interpret. That is, the population PCs
Y1, · · · ,Yp from Σ may not be practically meaningful! A common remedy in applications is to
work, not with X, but with its “standardized” version Zp×1 given by
Z = D?1/2σ
(
X?μ),
where Dσ = diag(σ21 , · · · ,σ2p), with σ21 , · · · ,σ2p the diagonal elements of Σ. Note that
Cov(Z) = D?1/2σ ΣD
?1/2
σ = Corr(X) = P,
where P is the correlation matrix of X 3 tr(P) = p.
Population principal components Y?1, · · · ,Y?p
The kth population PC Y?k from P is given by Y?k = v?>k Z, 3 Var(Y?k) = λ?k and Cov(Y?k,Y?k′ ) = 0,
?k′ 6= k, where v?k is the orthonormal eigenvector corresponding to the kth largest eigenvalue λ?k
of P, for k = 1, · · · , p.
Note that the population PCs are not invariant to the choice of scale matrix. That is, it is
possible that the population PCs Y1, · · · ,Yp from Σ are very different from the population PCs
Y?1, · · · ,Y?p from P. See Example 8.2 on p. 437 of JW for an illustration.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 5/20
Sample principal components
LetX1, · · · ,Xn rs～Np(μ,Σ), with Σ 0. Since S 0 wp 1, we have by SDT that S= V?D?λ V?
>, where
D?λ = diag(λ?1, · · · , λ?p), with λ?1 ≥ ·· · ≥ λ?p > 0 the ordered eigenvalues of S and V? = (v?1, · · · , v?p) the
orthogonal matrix with columns the corresponding orthonormal eigenvectors v?1, · · · , v?p.
For k = 1, let ?1 =
(
?11, · · · ,?1n
)> 3 ?1i = ?`>1Xi, for i= 1, · · · ,n, where ?`1 satisfies the following:
1 ‖ ?`1‖= 1; and
2 ?`1 = v?1 maximizes the “sample variance” ?`>1 S ?`1, 3 ?`>1 S ?`1∣∣ ?`1=v?1 = λ?1.
We thus have ?1 =Xn×pv?1, the n× 1 vector of the scores on first sample PC (i.e., score ?1i is
“sample observation” i on Y1), with X the data matrix. Alternatively, the data matrix X can
be standardized first as Zn×p = (Z1, · · · ,Zn)> =
(
X?1n?X>
)
diag
(
1
S1
, · · · , 1Sp
)
, so that we get
the sample correlation matrix R (based on X) given by R = 1n?2Z
>Z = SZ , the sample variance-
covariance matrix of Z1, · · · ,Zn (i.e., based on Z), since Z = 1n ∑ni=1Zi = 0p×1. The vector
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 6/20
Stock prices data (Table 8.4, p. 473 of JW)
Data on weekly rates of return of stocks of JP Morgan, Citibank, Wells Fargo, (Royal Dutch)
Shell, and Exxon (Mobil). The R function prcomp() was used to obtain the following results:
Table 1. The estimated coefficient vectors v?k and sample variances λ?k (i.e., kth
eigenvalue-eigenvector pair of S) of scores for the kth sample PC, for k = 1, · · · ,5, from
the sample variance-covariance matrix S of the stock prices data.
Stock Coefficients (v?k)
PC1 PC2 PC3 PC4 PC5
JP Morgan –0.469 0.368 –0.604 –0.363 0.384
Citibank –0.532 0.236 –0.136 –0.629 –0.496
Wells Fargo –0.465 0.315 0.772 0.289 0.071
Shell –0.387 –0.585 0.093 –0.381 0.595
Exxon –0.361 –0.606 –0.109 0.493 –0.498
λ?k 2.437 1.406 0.499 0.399 0.255
For i= 1, · · · ,103, we get the scores for the first and second sample PCs as
?1i = ?0.469Xi1?0.532Xi2?0.465Xi3?0.387Xi4?0.361Xi5,
?2i = 0.368Xi1+0.236Xi2+0.315Xi3?0.585Xi4?0.605Xi5.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 7/20
Scatterplot of scores ?1i and ?2i and scree plot of λ?ks
Note that ?1i
a∝ 151
>
5Xi, so that sample PC1 can be interpreted as the stock average while sample
PC2 can be viewed as a contrast between financial and oil stocks.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 8/20
Normality of scores ?1i and ?2i for PC1 and PC2
We have √n(λ??λ) D→ Np(0p×1,2D2λ ), as n→+∞, where λ= (λ1, · · · ,λp)> and λ?= (λ?1, · · · , λ?p)>.
Using the delta method, we get log
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 9/20
Police overtime data (Table 5.8, p. 240 of JW)
Data on overtime hours for the police department are classified according to “court appear-
ances” (legal), “extraordinary appearances” (extra), “holdover hours” (holdover), “COA hours”
(COA), and “meeting hours” (meeting). A PCA on the centred data yilds the following:
Table 2. The estimated coefficient vectors v?k and sample variances λ?k (i.e., kth
eigenvalue-eigenvector pair of S) of scores for the kth sample PC, for k = 1, · · · ,5, from
the sample variance-covariance matrix S of the centred police overtime data.
Overtime Coefficients (v?k)
PC1 PC2 PC3 PC4 PC5
legal 0.046 –0.048 0.629 –0.643 0.432
extra 0.039 0.985 –0.077 –0.151 –0.007
holdover –0.658 0.107 0.582 0.250 –0.392
COA 0.734 0.069 0.503 0.397 –0.213
meeting –0.155 0.107 0.081 0.586 0.784√
λ?k 1664.4 1195.5 792.5 470.2 315.9
Under multivariate normality of X, the centred PCs Yk = v>k (X?μ)～ N(0,λk), for k = 1, · · · , p,
so that ∑p
′
k=1
1
λ?k
? 2ki
a～ χ2p′ , provided n is large.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 10/20
Scatterplot of scores ?1i and ?2i and scree plot of λ?ks from centred S
Observation #11 had an unusually high “extra” value, hence, an unusually high value of ?2,11,
which is dominated by “extra”.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 11/20
T 2-control chart based on PCA on centred S
The sum ∑pk=p′+1 of squares of the last few standardized PC scores can be used to identify
out-of-control observations. Note that observations #12 and #13 fall outside or are near the
upper control limit.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 12/20
PCA on sample correlation matrix R of police overtime data
A PCA on the sample correlation matrix R of the police overtime data yields the following:
Table 3. The estimated coefficient vectors ??vk and sample variances ??λ k (i.e., kth
eigenvalue-eigenvector pair of R) of scores for the kth sample PC, for k = 1, · · · ,5, from
the sample correlation matrix R of the police overtime data.
Overtime Coefficients (vk)
PC1 PC2 PC3 PC4 PC5
legal 0.156 –0.648 0.660 –0.064 0.339
extra –0.074 0.677 0.607 0.404 0.069
holdover –0.599 –0.286 0.199 0.189 –0.695
COA 0.576 0.114 0.318 –0.466 –0.579
meeting –0.528 0.163 0.233 –0.761 0.247
λ k 2.163 1.146 1.024 0.522 0.144
The results are generally similar, albeit less clearcut. One issue here is that the asymptotic
approximations may not apply as well to the sample correlation matrix R as they do to the
sample covariance matrix S.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 13/20
Scatterplot of scores ?1i and ?2i and scree plot of λ?ks from R
While observation #11 still appears to be outlying, notice that it does not fall very far from the
boundary of the confidence ellipse.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 14/20
T 2-control chart based on PCA on R
Note that only observation #13 appears — and only barely — to be “out-of-control”.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 15/20
Factor analysis: Modelling the covariance structure
Factor analysis (FA) seeks to explain the high correlations between variables within groups of
variables in terms of a small number of latent (i.e. unobservable) random factors.
Orthogonal factor model with m common factors
Let Xp×1 be an observable random vector 3 E(X) =μp×1 and Cov(X) =Σp×p 0. The orthog-
onal factor model assumes that
X = μ+LF +,
where Fm×1 and p×1 are independent 3 m ≤ p, with E(F ) = 0m×1, E() = 0p×1, Cov(F ) = Im
(hence, orthogonal factors), and Cov() = Ψp×p = diag(ψ1, · · · ,ψp) 0. We have
Fk, k = 1, · · · ,m : kth common factor,
ε j, j = 1, · · · , p : jth specific factor,
` jk, j = 1, · · · , p, k = 1, · · · ,m : ( j,k)th factor loading,
ψ j, j = 1, · · · , p : jth “uniqueness” or specific variance.
The matrix Lp×m is the matrix of (common) factor loadings, with ( j,k)th element ` jk representing
variable X j’s loading on common factor Fk.
Note that the (orthogonal) factor model is similar to a regression model, except that the “design
matrix” L is not observable (i.e., unknown).
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 16/20
Structural model for Σ
The orthogonal factor model implies a structural model for Σ =
(
σ j j′
)
p×p given by
Σp×p = LL>+Ψ and Cov(X,F ) = L.
That is, Cov(X j,Fk) = ` jk, ? j,k, and
Var(X j) = σ j j = σ2j =
m
∑
k=1
`2jk +ψ j = h
2
j +ψ j, Cov(X j,X j′ ) = σ j j′ =
m
∑
k=1
` jk` j′k = h j j′ ,
where LL> =
(
h j j′
)
p×p, with diagonals h j j = h
2
j > 0 referred to as the communalities, which are
the variance components due to the factors. A simple example of a factorization of Σ3×3 (i.e.,
p= 3), with m= 1 common factor and L = 13, is
Σ = LL>+Ψ = J3+diag(ψ1,ψ2,ψ3).
Note that while we can always find L for m= p (with Ψ = 0), a “proper solution” (i.e., h j j′ > 0
and ψ j > 0, ? j, j′) may not exist when m p. See Example 9.2 on p. 486 of JW.
Scale invariance. Yp×1 =Cp×pX, with C a diagonal matrix (so that Cov(C) =CΨC> is
also diagonal), also satisfies the orthogonal factor model.
Factor rotation. For orthogonal matrix Γm×m, we have
Σ = LΓ(LΓ)>+Ψ = L?(L?)>+Ψ,
so that “rotating” L preserves the structural model for Σ. Since Γ is not unique, a
particular rotation is chosen based on interpretability of the common factors.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 17/20
Principal components solution
By SDT, we have Σ =VDλV> = ∑pj=1 λ jv jv>j , we have
Σ ≈
m
∑
k=1
λkvkv>k = L?L?
>
,
where
L? =
(√
λ1v1, · · · ,
√
λmvm
)
.
The matrix of communalities Ψ? = diag(ψ?1, · · · , ψ?m) is then taken as
Ψ? = diag
orthogonal matrix with columns the orthonormal eigenvectors corresponding to λ?1, · · · , λ?p. The
residual matrix êS is given by.
The above estimation method is often referred to as the principal factor method.
The sample variance-covariance matrix S may be replaced by the sample correlation matrix R.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 18/20
What should m be?
Deciding the value m is similar to deciding the number of PCs to be retained in PCA. Two common
approaches are as follows:
1 Via Frobenius norm. With the Frobenius norm
‖A‖2F = tr(A>A) = ‖vec(A)‖2,
we have
so that we can choose m 3 ∑pj=m+1 λ? 2j is small.
2 Via total contribution. Since ??h2j = ∑mk′=1 ?`?2jk′ , common factor Fk contributes ?`?2jk to sample
variance S2j of variable X j. Hence, the total contribution of common factor Fk to the total
sample variance tr(S) = ∑pj=1 S2j is
tr(L) ≈ 1 (i.e., ∑mk=1 λ?k ≈ tr(L)).
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 19/20
Maximum likelihood solution
Let X1, · · · ,Xn rs～ Np(μ,Σ) (i.e., Fi and i are MVN random vectors). With Σ =LL>+Ψ, the
likelihood function L(μ,L,Ψ) is given by
L(μ,L,Ψ) ∝ |Σ|np/2 exp
(he MLEs μ?=X, L, and Ψ? maximize L(μ,L,Ψ), subject to the constraint
L>Ψ?1L = ? = diag(δ1, · · · ,δm),
where δ1 > · · ·> δm. This yields an L, which is unique up to the multiplication of its columns by ±1.
Note that the orthogonal factor model is overparameterized, which renders it non-identifiable.
As a remedy, the above constraint is imposed on the model.
Sketch of proof of “uniqueness” of L
If Kp×m 3
LL> = KK>,
then K =LA, for some orthogonal matrix A. By the Singular Value Decomposition Theorem,
it can be shown that
A = A,
whence, A is a diagonal matrix with diagonal elements ±1.
Multivariate Analysis AR de Leon Lecture 3 Fall 2022 20/20