辅导MATH5885 、辅导R编程设计
- 首页 >> Algorithm 算法 MATH5885 Longitudinal Data Analysis
Term 2, 2022
Project
Due 23:59, Sunday, 31st July (end of Week 9) via Moodle.
The project should be submitted via the Assignment tool. This tool is accessible via a clearly
indicated link in the Assessments subfolder on moodle.You are allowed to work in pairs (groups) of
two if you wish. In that case, only one of the group members should submit the PDF file on Moodle,
with the names of both students clearly indicated and signed on the first page of the
document . The submitting student should add a cover page containing a copy of your student ID
card (or passport page if ID card is not available), and write with your own handwriting:
“I declare that this assignment is my own work, except where acknowledged and I have read and
understood the University rules regarding Academic Misconduct”, and sign it.
You must upload ONE pdf file containing all your working where all the R material should be at
the back of the project’s pdf file and be titled “Appendix”. Please include sufficient working, computer
code (adequately documented and commented) and output (adequately explained) so that I could fol-
low what you have done. As it is known since George Box that “all models are wrong but some are
useful” I do not expect any two submitted projects to be identical.
Please note that there are page limitations for the MAIN PART of the report:
maximum of 12 pages typed in minimum 12 pt font, single line spacing with minimum 2.5cm mar-
gins, single sided which should include mathematical summaries of the models fit, essential R code and
output only, any essential tabular and graphical output with a narrative about how you arrived at key
modelling decisions, and your summary of findings or conclusions. You should also describe any model
deficiencies and suggest possible remedies. Further details below.
There are no page limitations for the appendix part of the report that should contain the com-
plete R coding and any additional graphs and tables properly labelled so that the main report can cross
reference these and so that I can quickly locate the relevant R code and additional tables and graphs
should that be needed. This is NOT a defacto extension to your report. Your Part 1 Report should
stand on its own and be readable without reference to the Appendix.
If you are not skilled at producing typeset reports, then neatly handwritten reports are accept-
able provided the specifications on font size, margins, line spacing etc described above are reasonably
conformed to.
1
1 Project Background and Data
The project uses the CD4 dataset from DHLZ, introduced in Week 2. Please download the attached text
file cd4data.txt to use the data for your current analysis. Any of the explanatory variables included
in the data set may be considered for inclusion in your model, as well as fnctions of time. The response
variable is CD4+ cell count but you may also wish to consider transformations of the response. Basic
background information is available in documents:
1. DHLZ-CD4-BasicDataAnalysis.pdf, which contains some basic data analysis from Diggle et al.
2. ZegerDiggle-1994-Biometrics.pdf, which gives a published journal article using this dataset and
explains the variables observed in the study — see in particular their Section 5 for details.
The dataset consists of longitudinally collected observations on 369 subjects, resulting in a total of
2376 observations of CD4 cell counts denoted CD4 in the dataset. Other variables collected are:
1. Time: as the time (in years) since seroconversion, where a negative time denotes actual time
before seroconversion.
2. Age: age at seroconversion (a baseline measurement), centred at 30 years of age, so that negative
ages denote years younger than 30.
3. Packs: the number of packets of cigarettes smoked per day at time of measurement.
4. Drugs: a binary variable taking the values 1 or 0 to denote if the respondent takes recreational
drugs or not respectively, measured at each time point.
5. Sex: number of sexual partners reported at each time point. Looks to have been centred somehow
and truncated at ±5.
6. Cesd: an index of depression measured at each time point, with time trends removed. Higher
scores indicate greater depressive symptoms.
Zeger and Diggle (1994) suggest (Section 5):
“The first objective of this analysis is to characterize the population average time course
of CD4 decay while accounting for the following additional predictor variables: smoking
(packs per day); recreational drug use (yes or no); numbers of sexual partners; and depres-
sion symptoms as measured by the CESD scale (larger values indicate increased depressive
symptoms). The analysis was conducted on square-root-transformed CD4 numbers whose
distribution is more nearly Gaussian”
Later they state:
“The linear regression coefficients (standard errors in parentheses) for the covariates age
at seroconversion (years), packs of cigarettes, recreational drug use (0: no, 1: yes), number
of sexual partners, and depression score are: .037 (.18), .27 (.15), .37 (.31), .10 (.038),
and -.058 (.015), respectively. Age plays little role. Smoking, recreational drug use, and
increased numbers of sexual partners are associated with higher CD4 cell numbers. This may
reflect immune response stimulation or simply selection bias whereby healthier men choose
to continue these practices. Increased depressive symptoms are significantly associated with
decreased CD4 levels. Again, a causal direction cannot be inferred from this analysis.”
2
These estimated regression coefficients seem to be those obtained by least squares in a model in
which (page 694): “μ(t) was approximated by a knotted cubic spline with seven equally spaced knots.”
Note that the model of Zeger and Diggle uses square root of the CD4 cell counts as the response
variable and the other available variables are covariates. However, as they rightly point these other
variables cannot be inferred to cause the level of CD4 cell counts.
Available on Moodle is a document CD4InitialAnalysis.pdf. There is also an and accompanying R
Script file called CD4InitialAnalysis.R. These provide some preliminary exploratory data analysis and
an attempt to reproduce various results reported in Zeger and Diggle. As is often the case in scientific
papers, there is typically insufficient detail available to allow exact reproduction of the findings. In
particular, the point estimates and standard errors reported by Zeger and Diggle cannot be reproduced
despite best efforts to do so.
As a starting point, you should work through the R Script file CD4InitialAnalysis.R to ensure you
understand what each part of that does. Then you should undertake your own analysis for the project
as described in the next section.
3
2 Project Aims
The aim of the project is to determine a suitable model for the square root of CD4 cell counts as the
response variable with covariates time (suitably modelled), age, cigarettes, CESD score, drug use and
partners.
You should proceed as follows:
1. Using and adapting the techniques introduced in the course and in the above R script, perform
exploratory data analysis for the dataset in order to explore the mean structure, including the
impacts of the various covariates on the mean response and to explore the covariance structure
for the model randomness.
For example, this will include plots of individual and average profiles across time (possibly strati-
fied by levels of the other covariates), investigation of covariance structure, and any other analyses
you feel are relevant. Choose two or three preliminary fixed effects structures based on this analy-
sis. In particular you might want to model the response to time as a combination of linear or other
functions over segments of time. The model based on natural splines is provided as a starting
point to flexibly model the temporal trend in mean response. But it may be possible to simplify
this — up to you!
2. Fit these preliminary models using linear regression, comment on significance of regression coef-
ficients and obtain the residuals from these models.
3. You should consider possible components in the models for the covariance structure including
compound symmetry, unequal variances, random error, exponential or Gaussian autocorrelation
decay. Use correlation and/or variogram analysis to propose possible models for the covariance
of the residuals and any random effects components you may wish to include in the regression
specification. Compare your alternative models using appropriate statistical model fit criteria and
hypothesis tests. Select the “best” covariance model based on your analysis.
4. Consider whether your preliminary fixed effects structure needs to be adjusted in light of the
chosen covariance model and refit the adjusted model. Make your conclusions.
5. Obtain the estimated covariance and correlation matrices for a selected patient with 7 or 8 mea-
surements spanning (roughly evenly) time 0. Discuss how the variances vary with time, and how
the correlations vary with time between measurements.
6. Select four patients with 7 or 8 measurements spanning time 0. Try to select a range of patients
responding “high”, “medium” and “low” initially and over time. Use BLUPs to estimate the
individual trajectories for these patients and plot them on the same graph, along with their
observed levels of CD4 cell counts.
4
3 Your report
Write up a detailed report on your analysis. You should include:
Section 1: Introduction A very brief summary of the situation, the data and the objectives of your
analysis and report.
Section 2: Exploratory data analysis Briefly describe the results of exploratory data analysis and sum-
marize its results, including relevant graphical output.
Section 3: Model formulation This is the major section summarizing the steps taken and models tried
in arriving at your final model.
Describe and justify your model selection procedure, saying why you chose to fit the models
you did.
Explain why you prefer the model for fixed effects and error structure you ended up choosing.
Formulate a model for the random errors in terms of random effects, serial dependence and
pure noise.
Write down the final fitted model for the mean response including standard errors and
discussion of significance of covariates.
Discuss the effect of the explanatory variables on the response.
Discuss the main features of the covariance structure.
Discuss the properties of the residuals in the model and any impact these may have on
inferences you make about model fit and significance of model terms.
Section 4: Application to individual trajectories Include the results of the analyses specified in items
5 and 6 of the Project Aims.
Section 5: Discussion of modelling Discuss the difficulties you encountered with the analysis, and the
limitations of your model (if any).
The report’s quality will be assessed as if the report is for a decision maker who only wants the key
details in the main report but may want to easily access further detail in the Appendices.
Term 2, 2022
Project
Due 23:59, Sunday, 31st July (end of Week 9) via Moodle.
The project should be submitted via the Assignment tool. This tool is accessible via a clearly
indicated link in the Assessments subfolder on moodle.You are allowed to work in pairs (groups) of
two if you wish. In that case, only one of the group members should submit the PDF file on Moodle,
with the names of both students clearly indicated and signed on the first page of the
document . The submitting student should add a cover page containing a copy of your student ID
card (or passport page if ID card is not available), and write with your own handwriting:
“I declare that this assignment is my own work, except where acknowledged and I have read and
understood the University rules regarding Academic Misconduct”, and sign it.
You must upload ONE pdf file containing all your working where all the R material should be at
the back of the project’s pdf file and be titled “Appendix”. Please include sufficient working, computer
code (adequately documented and commented) and output (adequately explained) so that I could fol-
low what you have done. As it is known since George Box that “all models are wrong but some are
useful” I do not expect any two submitted projects to be identical.
Please note that there are page limitations for the MAIN PART of the report:
maximum of 12 pages typed in minimum 12 pt font, single line spacing with minimum 2.5cm mar-
gins, single sided which should include mathematical summaries of the models fit, essential R code and
output only, any essential tabular and graphical output with a narrative about how you arrived at key
modelling decisions, and your summary of findings or conclusions. You should also describe any model
deficiencies and suggest possible remedies. Further details below.
There are no page limitations for the appendix part of the report that should contain the com-
plete R coding and any additional graphs and tables properly labelled so that the main report can cross
reference these and so that I can quickly locate the relevant R code and additional tables and graphs
should that be needed. This is NOT a defacto extension to your report. Your Part 1 Report should
stand on its own and be readable without reference to the Appendix.
If you are not skilled at producing typeset reports, then neatly handwritten reports are accept-
able provided the specifications on font size, margins, line spacing etc described above are reasonably
conformed to.
1
1 Project Background and Data
The project uses the CD4 dataset from DHLZ, introduced in Week 2. Please download the attached text
file cd4data.txt to use the data for your current analysis. Any of the explanatory variables included
in the data set may be considered for inclusion in your model, as well as fnctions of time. The response
variable is CD4+ cell count but you may also wish to consider transformations of the response. Basic
background information is available in documents:
1. DHLZ-CD4-BasicDataAnalysis.pdf, which contains some basic data analysis from Diggle et al.
2. ZegerDiggle-1994-Biometrics.pdf, which gives a published journal article using this dataset and
explains the variables observed in the study — see in particular their Section 5 for details.
The dataset consists of longitudinally collected observations on 369 subjects, resulting in a total of
2376 observations of CD4 cell counts denoted CD4 in the dataset. Other variables collected are:
1. Time: as the time (in years) since seroconversion, where a negative time denotes actual time
before seroconversion.
2. Age: age at seroconversion (a baseline measurement), centred at 30 years of age, so that negative
ages denote years younger than 30.
3. Packs: the number of packets of cigarettes smoked per day at time of measurement.
4. Drugs: a binary variable taking the values 1 or 0 to denote if the respondent takes recreational
drugs or not respectively, measured at each time point.
5. Sex: number of sexual partners reported at each time point. Looks to have been centred somehow
and truncated at ±5.
6. Cesd: an index of depression measured at each time point, with time trends removed. Higher
scores indicate greater depressive symptoms.
Zeger and Diggle (1994) suggest (Section 5):
“The first objective of this analysis is to characterize the population average time course
of CD4 decay while accounting for the following additional predictor variables: smoking
(packs per day); recreational drug use (yes or no); numbers of sexual partners; and depres-
sion symptoms as measured by the CESD scale (larger values indicate increased depressive
symptoms). The analysis was conducted on square-root-transformed CD4 numbers whose
distribution is more nearly Gaussian”
Later they state:
“The linear regression coefficients (standard errors in parentheses) for the covariates age
at seroconversion (years), packs of cigarettes, recreational drug use (0: no, 1: yes), number
of sexual partners, and depression score are: .037 (.18), .27 (.15), .37 (.31), .10 (.038),
and -.058 (.015), respectively. Age plays little role. Smoking, recreational drug use, and
increased numbers of sexual partners are associated with higher CD4 cell numbers. This may
reflect immune response stimulation or simply selection bias whereby healthier men choose
to continue these practices. Increased depressive symptoms are significantly associated with
decreased CD4 levels. Again, a causal direction cannot be inferred from this analysis.”
2
These estimated regression coefficients seem to be those obtained by least squares in a model in
which (page 694): “μ(t) was approximated by a knotted cubic spline with seven equally spaced knots.”
Note that the model of Zeger and Diggle uses square root of the CD4 cell counts as the response
variable and the other available variables are covariates. However, as they rightly point these other
variables cannot be inferred to cause the level of CD4 cell counts.
Available on Moodle is a document CD4InitialAnalysis.pdf. There is also an and accompanying R
Script file called CD4InitialAnalysis.R. These provide some preliminary exploratory data analysis and
an attempt to reproduce various results reported in Zeger and Diggle. As is often the case in scientific
papers, there is typically insufficient detail available to allow exact reproduction of the findings. In
particular, the point estimates and standard errors reported by Zeger and Diggle cannot be reproduced
despite best efforts to do so.
As a starting point, you should work through the R Script file CD4InitialAnalysis.R to ensure you
understand what each part of that does. Then you should undertake your own analysis for the project
as described in the next section.
3
2 Project Aims
The aim of the project is to determine a suitable model for the square root of CD4 cell counts as the
response variable with covariates time (suitably modelled), age, cigarettes, CESD score, drug use and
partners.
You should proceed as follows:
1. Using and adapting the techniques introduced in the course and in the above R script, perform
exploratory data analysis for the dataset in order to explore the mean structure, including the
impacts of the various covariates on the mean response and to explore the covariance structure
for the model randomness.
For example, this will include plots of individual and average profiles across time (possibly strati-
fied by levels of the other covariates), investigation of covariance structure, and any other analyses
you feel are relevant. Choose two or three preliminary fixed effects structures based on this analy-
sis. In particular you might want to model the response to time as a combination of linear or other
functions over segments of time. The model based on natural splines is provided as a starting
point to flexibly model the temporal trend in mean response. But it may be possible to simplify
this — up to you!
2. Fit these preliminary models using linear regression, comment on significance of regression coef-
ficients and obtain the residuals from these models.
3. You should consider possible components in the models for the covariance structure including
compound symmetry, unequal variances, random error, exponential or Gaussian autocorrelation
decay. Use correlation and/or variogram analysis to propose possible models for the covariance
of the residuals and any random effects components you may wish to include in the regression
specification. Compare your alternative models using appropriate statistical model fit criteria and
hypothesis tests. Select the “best” covariance model based on your analysis.
4. Consider whether your preliminary fixed effects structure needs to be adjusted in light of the
chosen covariance model and refit the adjusted model. Make your conclusions.
5. Obtain the estimated covariance and correlation matrices for a selected patient with 7 or 8 mea-
surements spanning (roughly evenly) time 0. Discuss how the variances vary with time, and how
the correlations vary with time between measurements.
6. Select four patients with 7 or 8 measurements spanning time 0. Try to select a range of patients
responding “high”, “medium” and “low” initially and over time. Use BLUPs to estimate the
individual trajectories for these patients and plot them on the same graph, along with their
observed levels of CD4 cell counts.
4
3 Your report
Write up a detailed report on your analysis. You should include:
Section 1: Introduction A very brief summary of the situation, the data and the objectives of your
analysis and report.
Section 2: Exploratory data analysis Briefly describe the results of exploratory data analysis and sum-
marize its results, including relevant graphical output.
Section 3: Model formulation This is the major section summarizing the steps taken and models tried
in arriving at your final model.
Describe and justify your model selection procedure, saying why you chose to fit the models
you did.
Explain why you prefer the model for fixed effects and error structure you ended up choosing.
Formulate a model for the random errors in terms of random effects, serial dependence and
pure noise.
Write down the final fitted model for the mean response including standard errors and
discussion of significance of covariates.
Discuss the effect of the explanatory variables on the response.
Discuss the main features of the covariance structure.
Discuss the properties of the residuals in the model and any impact these may have on
inferences you make about model fit and significance of model terms.
Section 4: Application to individual trajectories Include the results of the analyses specified in items
5 and 6 of the Project Aims.
Section 5: Discussion of modelling Discuss the difficulties you encountered with the analysis, and the
limitations of your model (if any).
The report’s quality will be assessed as if the report is for a decision maker who only wants the key
details in the main report but may want to easily access further detail in the Appendices.