代写STA303H1: Methods of Data Analysis II
- 首页 >> Java编程STA303H1: Methods of Data Analysis II
(Lecture 4)
Mohammad Kaviul Anam Khan
PhD (candidate) in Biostatistics
sta303@utoronto.ca
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 1 / 34
To summarize
• In categorical data analysis our outcome (response) is categorical or discrete
• So far we discussed the covariate (independent variable) is also categorical
• We can also deal with a third variable. Investigate whether the third variable is a
confounder or an interaction variable
• Interaction means, that the third variable modeifies the effect of our exposure
• What about a fourth, fifth or sixth variable
• In real life data there is always existence of a large number of variables
• We learned how to measure the association. But what about prediction?
• What about continuous independent variables??
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 2 / 34
Generalized Linear Models (GLM)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 3 / 34
Regression of Binary Variable
• What is regression??
• Let Y be a continuous response and X is a continuous covariate
• Then we can assume the relationship, E (Y |X ) = β0 + β1X .
• The residuals Y − E (Y |X ) has mean = 0 and equal variance (homoscedasticity)
• Now assume Y is a binary variable. Then, can we have E (Y |X ) = β0 + β1X .
• Since Y = 0, 1, then 0 < E (Y |X ) < 1⇒ 0 < β0 + β1X < 1.
• This assumption may not hold
• Recall E (Y |X ) = π is a probability for binary variable
• What if we take the log of this. That is log(E (Y |X )) = β0 + β1X
• Since, 0 < E (Y |X ) < 1, then −∞ < β0 + β1X < 0
• Then what should be the approach
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 4 / 34
Regression of Binary Variable
• Recall E (Y |X ) can be defined as a risk, i.e., E (Y |X ) = P(Y = 1|X )
• Due to restricted space (−∞, 0) it is difficult to model risk
• But what about odds, Ω = E (Y |X )1− E (Y |X )
• Then log odds will have a range (−∞,∞)
• Thus, we can model,
log
( E (Y |X )
1− E (Y |X )
)
= β0 + β1X
• This is the form of the famous logistic regression
• The link log-odds is also called the ‘logit’ link (What is a link function?)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 5 / 34
Logistic Regression
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 6 / 34
Logistic Regression
• Continuing our previous discussion. Let Y = 0, 1 is an binary outcome, X = 0, 1
is our exposure of interest and Z = 0, 1 is a third variable
• Let the model,
log
( E (Y |X ,Z )
1− E (Y |X ,Z )
)
= β0 + β1X + β2Z
• When, Z = 0, the odds ratio θZ=0 = exp(β1)
• When, Z = 1, the odds ratio θZ=1 = exp(β1)
• Thus the interpretation of exp(β1) is that when, Z is fixed at a constant value
then the odds ratio between X = 1 and X = 0 is exp(β1)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 7 / 34
Logistic Regression
• However, when interaction exists, the model is,
log
( E (Y |X ,Z )
1− E (Y |X ,Z )
)
= β0 + β1X + β2Z + β12XZ
• When, Z = 0, the odds ratio θZ=0 = exp(β1)
• When, Z = 1, the odds ratio θZ=1 = exp(β1 + β12)
• The odds ratios have to be interpreted separately for the levels of Z
• The interpretation of exp(β12) is: how much the odds ratio changes by the level
of Z
• Often referred to as the ratio of odds ratios (ratio-in-ratio parameter)
• How do we estimate the βs
• First start with linear models
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 8 / 34
Least Squared Estimates
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 9 / 34
Linear Models
• Assume the following linear regression model,
Y = Xβ + ϵ
• Here , Y is an n × 1 vector, and X is a n × p matrix of covariates and β is p × 1
vector of regression of covariates
• To esimate β, the target is to minimize ϵT ϵ, w.r.t. β, or,
βˆ = argmin
β
(Y − Xβ)T (Y − Xβ)
This is called the ordinary least squared (OLS) equation
• The least squared estimates are βˆ =
(
XTX
)−1
XTY
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 10 / 34
Linear Models
• Recall the OLS method produces the same estimates as the MLE assuming
Y ∼ N(Xβ, σ2I) equavalent with assuming ϵ ∼ N(0, σ2I), where I is n × n
identity matrix. The variance of the estimates are different (Gauss-Markov
assumption)
• However, the OLS cannot be performed for Generalized Linear Models (GLM),
since most of the cases Y is not continuous
• Thus the estimation from GLM models are conducted with MLE
• But before understanding the estimation procedures we need to understand few
related topics such as, the link function and exponential families
• That is the goal for this lecture
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 11 / 34
Exponential Families
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 12 / 34
Exponential Families
• Let Y ∼ fY (y ; θ, ϕ). If fY falls into exponential family, then, it can be written as,
fY (y ; θ, ϕ) = exp
[yθ − b(θ)
a(ϕ) + c(y , ϕ)
]
• Here,
• θ is a canonical (natural) parameter
• a(.), b(.) and c(.), are known functions
• ϕ is a dispersion parameter
Normal Distribution
For normal distribution we know,
fY (y ;µ, σ2) =