辅导STAT 337、辅导Java,c++程序
- 首页 >> CS STAT 337 ASSIGNMENT 2 Due: 5:00pm EDT Thursday, June 16, 2022
Notes for Submission: Upload your assignment directly to Crowdmark via the link you
receive by email. It is your responsibility to make sure your solution to each question is
submitted in the correct section, that the pages are rotated correctly, and that everything is
legible. Typed solutions are preferred.
Notes on the use of statistical software: Unless specifically told otherwise, you are free
to do your calculations using any software you like (SAS, R, Excel, etc) but your solutions
should clearly explain the steps you used in the computation, showing intermediate calcu-
lations when necessary, and give the formulas that you used. Any code and output created
should also be submitted.
1. [6 marks] In 2020, a group of eight articles were published in the Journal of Studies on
Alcohol and Drugs summarizing the current scientific literature and evidence related to
the research question: Does exposure to alcohol marketing have a causal influence on
youth drinking?1 For each statement below (lighted edited from the original source),
indicate which of the seven Bradford Hill criteria discussed in class are related to the
statement. Multiple criteria may be addressed in each case.
(a) Jernigan et al. (2017) conducted a systematic review of longitudinal studies that examined
exposure to advertising and drinking among underage persons. All 12 studies found a positive
association between marketing exposure and one or more alcohol consumption outcomes. For
initiation of alcohol use the odds ratios for di?erent marketing exposures ranged from 1.00 to
1.69, and for subsequent hazardous or binge drinking, the range was somewhat higher: 1.38 to
2.15.
(b) In recent years, psychologists have developed and tested theoretical models in which marketing
exposures are hypothesized to a?ect psychological mediators relating to thoughts, cognitions and
attitudes. These marketing-induced changes are hypothesized to predict whether an individual
will engage in drinking behaviour. Jackson and Bartholow (2020) provide a narrative summary
of psychological plausibility using an integrated conceptual model that depicts relevant psycho-
logical processes as they work together in a complex chain of influence.
(c) Hanewinkel et al. (2008) conducted a prospective observational study of 2110 German adoles-
cents younger than 15 years who had never smoked or drunk alcohol at baseline. The percentage
of students who tried smoking was 16.3%, 10.9% initiated binge drinking and 5.0% used both
substances during the follow-up period. There was a significant e?ect of parental movie restric-
tion on each substance use outcome measure after controlling for covariates. Compared with
adolescents whose parents never allowed them to view FSK-16 movies (movies that only those
aged 16 years and over would be allowed to see in theatres), the adjusted relative risk (RR) for
use of both substances were 1.64 for adolescents allowed to view them once in a while, 2.30 for
sometimes and 2.92 for all the time. FSK-16 restrictions were associated with substantially lower
exposure to movie depiction of tobacco and alcohol use.
1Sargent, J. D., Cukier, S., & Babor, T. F. (2020). Alcohol marketing and youth drinking: is there a causal
relationship, and why does it matter?. Journal of Studies on Alcohol and Drugs, Supplement, (s19), 5-12.
1
2. [10 marks]
(a) [4 marks] HIV disease may increase susceptibility to other viral infections. A co-
hort study investigated the association of HIV with the occurrence of cytomegalovirus
(CMV) infection, a common herpes virus. Researchers screened infectious disease
clinics to identify a cohort of 400 HIV-positive patients who were seronegative for
CMV. The researchers then identified a comparison cohort of 400 people without
HIV disease from primary care clinics who were also CMV seronegative. Study
personnel conduct annual testing to assess new CMV infections, defined by the
development of antibodies to the virus. The study data are presented in Tables 1
and 2.
For each of the six characteristics listed in Table 1 determine whether or not it is a
potential confounder for the association between HIV and incident CMV infection.
Explain your reasoning.
Table 1: Baseline characteristics of the study participants
HIV HIV
positive negative
Mean Age (years) 47.3 47.1
African American (%) 37.3 18.9
Male (%) 54.0 52.9
Mean Body mass index (kg/m2) 23.2 27.9
Intravenous drug use (%) 35.4 4.1
Mean CD4 lymphocyte count (cells/mm3) 187 1440
Table 2: Associations of study characteristics with incident CMV infection
Unadjusted relative risk
of CMV infection
HIV disease 4.05
Age (per 10-year higher) 2.92
African American (compared to Caucasian) 1.01
Male (compared to female) 2.05
Body mass index (per 5 kg/m2 higher) 1.03
Intravenous drug use (yes versus no) 1.86
CD4 lymphocyte count (per 100 cells/mm3 increase) 2.70
2
(b) The Heart and Estrogen/Progestin study (HERS) was randomized clinical trial of
hormone replacement therapy in post-menopausal women with existing coronary
heart disease (CHD)2. We will consider multiple linear regression models fit to
baseline data collected on the cohort of 2,763 women3. For the purposes of this
question, you can think of the data as coming from a cross-sectional study.
i. [3 marks] Consider the fitted multiple linear regression model presented in
Table 3. The response is LDL cholesterol and the primary exposure or vari-
able of interest is body mass index (BMI) (a continuous variable measured in
kg/m2). A set of potential confounders are also included in the model: age,
ethnicity (nonwhite), smoking, and alcohol use (drinkany). Age is a continu-
ous explanatory variables and the rest are binary explanatory variables. Give
a precise written interpretation of the regression parameter for the BMI term.
Is this result statistically significant?
ii. [1 mark] Using the model in Table 3 find the predicted LDL cholesterol value
for a 65 year old woman, who is white, doesn’t smoke but does occasionally
drink and who has a BMI of 24 kg/m2.
iii. [2 marks] Now consider the fitted multiple linear regression model presented
in Table 4. This model includes a binary indicator of statin use (a class of
drugs used to lower cholesterol levels) and the interaction between this vari-
able and BMIc. Note that the BMI variable has been centred its mean value
of 28.6 kg/m2 (i.e. BMIc=BMI-28.6). This makes the parameter estimate for
statin use more interpretable.
Using estimates from the fitted model, describe the association between BMI
(using BMIc) and LDL among statin users and non-users (2-3 sentences). Is
there evidence that statin use is an e?ect modifier for the association between
BMI and LDL cholesterol? Explain your reasoning.
2Hulley, S., Grady, D., Bush, T., Furberg, C., Herrington, D., Riggs, B. and Vittingho?, E. (1998). Randomized
trial of estrogen plus progestin for secondary prevention of heart disease in postmenopausal women. The Heart and
Estrogen/progestin Replacement Study. Journal of the American Medical Association, 280(7), 605-613.
3Vittingho?, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2011). Regression methods in biostatistics:
linear, logistic, survival, and repeated measures models. Springer Science & Business Media.
3
Table 3: Fitted multiple linear regression model from HERS study
MODEL LDL = BMI age nonwhite smoking drinkany
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 147.3153 9.2564 15.91 0.000
BMI 1 0.3591 0.1341
age 1 -0.1897 0.1131 -1.68 0.094
nonwhite 1 5.2194 2.3237 2.25 0.025
smoking 1 4.7507 2.2104 2.15 0.032
drinkany 1 -2.7223 1.4989 -1.82 0.069
Table 4: Fitted multiple linear regression model with interaction from HERS study
MODEL LDL = statins BMIc statins BMIc age nonwhite smoking drinkany
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 162.4052 7.5833 21.42 0.000
statins 1 -16.2530 1.4688 -11.07 0.000
BMIc 1 0.5821 0.1601 3.64 0.000
statins BMIc 1 -0.7019 0.2694 -2.61 0.009
age 1 -0.1729 0.1106 -1.56 0.118
nonwhite 1 4.0728 2.2751 1.79 0.074
smoking 1 3.1098 2.1670 1.44 0.151
drinkany 1 -2.0753 1.4666 -1.42 0.157
4
3. [10 marks] This question is based on the following paper:
Bulfone, T. C., Blat, C., Chen, Y. H., Rutherford, G. W., Gutierrez-Mock, L.,
Nickerson, A., ... & Reid, M. J. (2022). Outdoor Activities Associated with
Lower Odds of SARS-CoV-2 Acquisition: A Case-Control Study. Interna-
tional Journal of Environmental Research and Public Health, 19(10), 6126.3.
You can download the paper from https://doi.org/10.3390/ijerph19106126. The
following questions will lead you through a discussion of the design and a simple unad-
justed analysis of some of the data from this study.
(a) [1 marks] In your own words state the goal/purpose of this case-control study.
(b) [2 marks] Who are the cases in this study and how were they identified/selected?
Who are the controls and how were they identified/selected?
(c) [2 marks] Give two inclusion or exclusion criteria used in the selection of the cases
and controls above.
(d) [1 marks] What is the primary exposure of interest and how was it assessed?
(e) [2 marks] Using the data given in Table 2 calculate and interpret the (unmatched,
unadjusted) Odds Ratio for the primary association of interest in this study.
(f) [2 marks] Describe at least two potential limitations of this study and/or sources
of bias or error.
5
4. [12 marks] In this question you will explore matching in case-control studies. Consider
the data in Table 5 giving case counts for a rare disease D and a common exposure E
in a closed population, stratified by a common binary confounder X. This represents
the full data in your study population and is normally unobservable.
Table 5: Hypothetical study population
X+ X Overall
E+ E E+ E E+ E
Cases D+ 80 10 100 200 180 210
Non-cases D 80,000 20,000 20,000 80,000 100,000 100,000
Odds Ratio 2.0 2.0 0.86
Source: Pearce, N. (2016). Analysis of matched case-control studies. BMJ, 352.
(a) [2 marks] You and your colleagues decide to run an unmatched case-control study
to investigate the association between E and D. You include all 390 cases from
your population and a random sample of 390 controls. Recreate Table 5 for this
study. Use the true sample population prevalences to generate your controls. For
example, the number of controls with (E+, X+) will be 390 ? P [E+, X + |D].
(b) [4 marks] Calculate the stratum-specific and unstratified/overall Odds Ratios for
the data from your unmatched case-control study in (a) and compare them to the
true population values in Table 5. Supposed you ignored (or were unaware of) X
and based your analysis on the unstratified case-control data. Test the significance
of the unstratified Odds Ratio using a 2 test. Be sure to clearly state the null and
alternative hypotheses, give the formula for the test statistic, calculate its value
and find the p-value. What is the conclusion of the test? Would your conclusions
from this study accurately reflect the true association between E and D?
(c) [2 marks] Now suppose you and your colleagues decide to run a matched case-
control study. Once again you include all 390 cases and you match based on X.
Generate stratified and overall matched 2? 2 tables from this study. Assume,
given X, the exposure statuses of a matched pair are independent and based on
the true sample population prevalences. For example, for X+ there will be 90
matched pairs and the number of pairs with both the case and control exposed
will be 90 ? P [E + |D+, X+]P [E + |D, X+].
(d) [4 marks] Using the matched 2?2 table from (c) calculate the matched pair Odds
Ratio and compare it to the true population values in Table 5. Use McNemar’s
Test to test the significance of the association between E and D. Be sure to clearly
state the null and alternative hypotheses, give the formula for the test statistic,
calculate its value and find the p-value. What is the conclusion of the test? Would
your conclusions from this study accurately reflect the true association between E
and D?
Notes for Submission: Upload your assignment directly to Crowdmark via the link you
receive by email. It is your responsibility to make sure your solution to each question is
submitted in the correct section, that the pages are rotated correctly, and that everything is
legible. Typed solutions are preferred.
Notes on the use of statistical software: Unless specifically told otherwise, you are free
to do your calculations using any software you like (SAS, R, Excel, etc) but your solutions
should clearly explain the steps you used in the computation, showing intermediate calcu-
lations when necessary, and give the formulas that you used. Any code and output created
should also be submitted.
1. [6 marks] In 2020, a group of eight articles were published in the Journal of Studies on
Alcohol and Drugs summarizing the current scientific literature and evidence related to
the research question: Does exposure to alcohol marketing have a causal influence on
youth drinking?1 For each statement below (lighted edited from the original source),
indicate which of the seven Bradford Hill criteria discussed in class are related to the
statement. Multiple criteria may be addressed in each case.
(a) Jernigan et al. (2017) conducted a systematic review of longitudinal studies that examined
exposure to advertising and drinking among underage persons. All 12 studies found a positive
association between marketing exposure and one or more alcohol consumption outcomes. For
initiation of alcohol use the odds ratios for di?erent marketing exposures ranged from 1.00 to
1.69, and for subsequent hazardous or binge drinking, the range was somewhat higher: 1.38 to
2.15.
(b) In recent years, psychologists have developed and tested theoretical models in which marketing
exposures are hypothesized to a?ect psychological mediators relating to thoughts, cognitions and
attitudes. These marketing-induced changes are hypothesized to predict whether an individual
will engage in drinking behaviour. Jackson and Bartholow (2020) provide a narrative summary
of psychological plausibility using an integrated conceptual model that depicts relevant psycho-
logical processes as they work together in a complex chain of influence.
(c) Hanewinkel et al. (2008) conducted a prospective observational study of 2110 German adoles-
cents younger than 15 years who had never smoked or drunk alcohol at baseline. The percentage
of students who tried smoking was 16.3%, 10.9% initiated binge drinking and 5.0% used both
substances during the follow-up period. There was a significant e?ect of parental movie restric-
tion on each substance use outcome measure after controlling for covariates. Compared with
adolescents whose parents never allowed them to view FSK-16 movies (movies that only those
aged 16 years and over would be allowed to see in theatres), the adjusted relative risk (RR) for
use of both substances were 1.64 for adolescents allowed to view them once in a while, 2.30 for
sometimes and 2.92 for all the time. FSK-16 restrictions were associated with substantially lower
exposure to movie depiction of tobacco and alcohol use.
1Sargent, J. D., Cukier, S., & Babor, T. F. (2020). Alcohol marketing and youth drinking: is there a causal
relationship, and why does it matter?. Journal of Studies on Alcohol and Drugs, Supplement, (s19), 5-12.
1
2. [10 marks]
(a) [4 marks] HIV disease may increase susceptibility to other viral infections. A co-
hort study investigated the association of HIV with the occurrence of cytomegalovirus
(CMV) infection, a common herpes virus. Researchers screened infectious disease
clinics to identify a cohort of 400 HIV-positive patients who were seronegative for
CMV. The researchers then identified a comparison cohort of 400 people without
HIV disease from primary care clinics who were also CMV seronegative. Study
personnel conduct annual testing to assess new CMV infections, defined by the
development of antibodies to the virus. The study data are presented in Tables 1
and 2.
For each of the six characteristics listed in Table 1 determine whether or not it is a
potential confounder for the association between HIV and incident CMV infection.
Explain your reasoning.
Table 1: Baseline characteristics of the study participants
HIV HIV
positive negative
Mean Age (years) 47.3 47.1
African American (%) 37.3 18.9
Male (%) 54.0 52.9
Mean Body mass index (kg/m2) 23.2 27.9
Intravenous drug use (%) 35.4 4.1
Mean CD4 lymphocyte count (cells/mm3) 187 1440
Table 2: Associations of study characteristics with incident CMV infection
Unadjusted relative risk
of CMV infection
HIV disease 4.05
Age (per 10-year higher) 2.92
African American (compared to Caucasian) 1.01
Male (compared to female) 2.05
Body mass index (per 5 kg/m2 higher) 1.03
Intravenous drug use (yes versus no) 1.86
CD4 lymphocyte count (per 100 cells/mm3 increase) 2.70
2
(b) The Heart and Estrogen/Progestin study (HERS) was randomized clinical trial of
hormone replacement therapy in post-menopausal women with existing coronary
heart disease (CHD)2. We will consider multiple linear regression models fit to
baseline data collected on the cohort of 2,763 women3. For the purposes of this
question, you can think of the data as coming from a cross-sectional study.
i. [3 marks] Consider the fitted multiple linear regression model presented in
Table 3. The response is LDL cholesterol and the primary exposure or vari-
able of interest is body mass index (BMI) (a continuous variable measured in
kg/m2). A set of potential confounders are also included in the model: age,
ethnicity (nonwhite), smoking, and alcohol use (drinkany). Age is a continu-
ous explanatory variables and the rest are binary explanatory variables. Give
a precise written interpretation of the regression parameter for the BMI term.
Is this result statistically significant?
ii. [1 mark] Using the model in Table 3 find the predicted LDL cholesterol value
for a 65 year old woman, who is white, doesn’t smoke but does occasionally
drink and who has a BMI of 24 kg/m2.
iii. [2 marks] Now consider the fitted multiple linear regression model presented
in Table 4. This model includes a binary indicator of statin use (a class of
drugs used to lower cholesterol levels) and the interaction between this vari-
able and BMIc. Note that the BMI variable has been centred its mean value
of 28.6 kg/m2 (i.e. BMIc=BMI-28.6). This makes the parameter estimate for
statin use more interpretable.
Using estimates from the fitted model, describe the association between BMI
(using BMIc) and LDL among statin users and non-users (2-3 sentences). Is
there evidence that statin use is an e?ect modifier for the association between
BMI and LDL cholesterol? Explain your reasoning.
2Hulley, S., Grady, D., Bush, T., Furberg, C., Herrington, D., Riggs, B. and Vittingho?, E. (1998). Randomized
trial of estrogen plus progestin for secondary prevention of heart disease in postmenopausal women. The Heart and
Estrogen/progestin Replacement Study. Journal of the American Medical Association, 280(7), 605-613.
3Vittingho?, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2011). Regression methods in biostatistics:
linear, logistic, survival, and repeated measures models. Springer Science & Business Media.
3
Table 3: Fitted multiple linear regression model from HERS study
MODEL LDL = BMI age nonwhite smoking drinkany
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 147.3153 9.2564 15.91 0.000
BMI 1 0.3591 0.1341
age 1 -0.1897 0.1131 -1.68 0.094
nonwhite 1 5.2194 2.3237 2.25 0.025
smoking 1 4.7507 2.2104 2.15 0.032
drinkany 1 -2.7223 1.4989 -1.82 0.069
Table 4: Fitted multiple linear regression model with interaction from HERS study
MODEL LDL = statins BMIc statins BMIc age nonwhite smoking drinkany
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 162.4052 7.5833 21.42 0.000
statins 1 -16.2530 1.4688 -11.07 0.000
BMIc 1 0.5821 0.1601 3.64 0.000
statins BMIc 1 -0.7019 0.2694 -2.61 0.009
age 1 -0.1729 0.1106 -1.56 0.118
nonwhite 1 4.0728 2.2751 1.79 0.074
smoking 1 3.1098 2.1670 1.44 0.151
drinkany 1 -2.0753 1.4666 -1.42 0.157
4
3. [10 marks] This question is based on the following paper:
Bulfone, T. C., Blat, C., Chen, Y. H., Rutherford, G. W., Gutierrez-Mock, L.,
Nickerson, A., ... & Reid, M. J. (2022). Outdoor Activities Associated with
Lower Odds of SARS-CoV-2 Acquisition: A Case-Control Study. Interna-
tional Journal of Environmental Research and Public Health, 19(10), 6126.3.
You can download the paper from https://doi.org/10.3390/ijerph19106126. The
following questions will lead you through a discussion of the design and a simple unad-
justed analysis of some of the data from this study.
(a) [1 marks] In your own words state the goal/purpose of this case-control study.
(b) [2 marks] Who are the cases in this study and how were they identified/selected?
Who are the controls and how were they identified/selected?
(c) [2 marks] Give two inclusion or exclusion criteria used in the selection of the cases
and controls above.
(d) [1 marks] What is the primary exposure of interest and how was it assessed?
(e) [2 marks] Using the data given in Table 2 calculate and interpret the (unmatched,
unadjusted) Odds Ratio for the primary association of interest in this study.
(f) [2 marks] Describe at least two potential limitations of this study and/or sources
of bias or error.
5
4. [12 marks] In this question you will explore matching in case-control studies. Consider
the data in Table 5 giving case counts for a rare disease D and a common exposure E
in a closed population, stratified by a common binary confounder X. This represents
the full data in your study population and is normally unobservable.
Table 5: Hypothetical study population
X+ X Overall
E+ E E+ E E+ E
Cases D+ 80 10 100 200 180 210
Non-cases D 80,000 20,000 20,000 80,000 100,000 100,000
Odds Ratio 2.0 2.0 0.86
Source: Pearce, N. (2016). Analysis of matched case-control studies. BMJ, 352.
(a) [2 marks] You and your colleagues decide to run an unmatched case-control study
to investigate the association between E and D. You include all 390 cases from
your population and a random sample of 390 controls. Recreate Table 5 for this
study. Use the true sample population prevalences to generate your controls. For
example, the number of controls with (E+, X+) will be 390 ? P [E+, X + |D].
(b) [4 marks] Calculate the stratum-specific and unstratified/overall Odds Ratios for
the data from your unmatched case-control study in (a) and compare them to the
true population values in Table 5. Supposed you ignored (or were unaware of) X
and based your analysis on the unstratified case-control data. Test the significance
of the unstratified Odds Ratio using a 2 test. Be sure to clearly state the null and
alternative hypotheses, give the formula for the test statistic, calculate its value
and find the p-value. What is the conclusion of the test? Would your conclusions
from this study accurately reflect the true association between E and D?
(c) [2 marks] Now suppose you and your colleagues decide to run a matched case-
control study. Once again you include all 390 cases and you match based on X.
Generate stratified and overall matched 2? 2 tables from this study. Assume,
given X, the exposure statuses of a matched pair are independent and based on
the true sample population prevalences. For example, for X+ there will be 90
matched pairs and the number of pairs with both the case and control exposed
will be 90 ? P [E + |D+, X+]P [E + |D, X+].
(d) [4 marks] Using the matched 2?2 table from (c) calculate the matched pair Odds
Ratio and compare it to the true population values in Table 5. Use McNemar’s
Test to test the significance of the association between E and D. Be sure to clearly
state the null and alternative hypotheses, give the formula for the test statistic,
calculate its value and find the p-value. What is the conclusion of the test? Would
your conclusions from this study accurately reflect the true association between E
and D?