STAT4038留学生辅导、讲解data
- 首页 >> CSSTAT2008/STAT4038/STAT6014/STAT6038 Assignment 2 Page 1 of 2
Regression Modelling - Assignment 2
Total of 100 Marks
Due on 11/10/2019 23:59
Question 1 [40 marks]
We consider the Study on the Efficacy of Nosocomial Infection Control (SENIC Project) where data
was collected to determine whether infection surveillance and control programs have reduced the
rates of nosocomial (hospital acquired) infection in United States hospitals. This data set consists of
a random sample of 113 hospitals selected from the original 338 hospitals surveyed.
Each line of the data set has an identification number and provides information on 11 other variables
for a single hospital. The data presented here are for the 1975-76 study period. The 12 variables are:
Identification number [Id]; Average length of stay of all patients (in days) [Length]; Average age of
patients (in years) [Age]; Average estimated probability of acquiring infection in hospital (in percent)
[Risk]; Ratio of number of cultures performed to number of patients without signs or symptoms of
hospital-acquired infections, times 100 [Culture]; Ratio of number of X-rays performed to number
of patients without signs or symptoms of pneumonia, times 100 [Xray]; Average number of beds
in hospital during study period [Beds]; Medical school affiliation (1=Yes, 2=No) [Affiliation];
Geographic region, where: 1=NE, 2=NC, 3=S, 4=W [Region]; Average number of patients in
hospital per day during study period [Patients]; Average number of full-time equivalent registered
and licensed practical nurses during study period (number of full time plus one half the number of
part time) [Nurses]; Percent of 35 potential facilities and services that are provided by the hospital
[Facilities].
(a)[6] Fit a multiple linear regression (MLR) model with Risk as the response variable and all other
covariates (excluding Id) as predictors. Is the regression model significant?
(b)[8] What are the estimated coefficients of the (MLR) model in part (a) and the standard errors
associated with these coefficients? Interpret the values of these estimated coefficients with
regards to model specification.
(c)[6] There is a t-test associated with each of these coefficients. Briefly explain, what these tests
can or cannot be used for? In your answer, be sure to mention the appropriate hypotheses that
can be assessed using these t-tests.
(d)[6] Construct an appropriate test of the hypothesis that Age and Beds are not significant contributors
to the model. That is, test βAge = βBeds = 0.
(e)[8] Imagine you are doing this work as a data scientist at a hospital and a (statistical uneducated)
colleague suggests that a model with coefficients βLength = 0.25, βCulture = 0.05, and
βRegion = 0.3 may be a better model. How would you fit such a model and what would be the
estimate of the intercept term with these coefficients? What criticisms do you have about this
suggested model?
(f)[6] The Asheville, N.C-based Mission Hospital is making progress on a 12-story surgery tower that will
house more than 400 beds. They would like a prediction on the expected risk of infection within
this new extension if Length=11, Age=45, Culture=18, Xray=100, Beds=400, Region=2,
Patients=400, Nurses=340, Facilities=52, Affiliation=1.
What do you predict the risk of infection to be? Find a 99% interval for this prediction.
Dale Roberts - Australian National University
Last updated: September 27, 2019
STAT2008/STAT4038/STAT6014/STAT6038 Assignment 2 Page 2 of 2
Question 2 [60 marks]
Company executives from a large packaged foods manufacturer wished to determine which factors
influenced the market share of one of its products. Data were collected from a national database
(Nielson) for 36 consecutive months. Each line of the data set has an identification number and
provides information on 6 other variables for each month. The variables are: Identification number
[Id]; Average monthly market share for product (percent) [Share]; Average monthly price of
product (dollars) [Price]; An index of the amount of advertising exposure that the product received
[Exposure]; Presence or absence of discount price during period: 1 if discounted, 0 otherwise
[Discounted]; Presence or absence of package promotion during period: 1 if promotion present, 0
otherwise [Promoted]; Month [Month]; Year [Year]. The data was collected during September 1999
(Id = 1), October 1999 (Id = 2), . . . , August 2002 (Id = 36).
(a)[10] Fit a multiple linear regression (MLR) model with Share as the response variable and all
other covariates as predictors (excluding Id). Is the regression model significant? Interpret
the coefficients for the categorical variables in this model. Does the coefficient support the
expectations that discounting the price increases market share?
(b)[6] The executives are interested to know if discounting and package promotions have an effect on
market share. Conduct a formal test of the hypothesis that
βDiscounted = βPromoted = 0
using an appropriate ANOVA table. Evaluate the F-statistic and the corresponding p-value.
(c)[6] Assuming that the other variables remain fixed, the company executives would like to know
your prediction in difference of market share over the month if they discount the price and also
promote their product. Base your answer on the model fitted in part (a).
(d)[8] One executive suggests that, in his opinion, discounting the product and promoting the product
have a similar effect on market share so the company should pursue the strategy that costs the
least. Test whether the coefficients of Discounted and Promoted are the same. Construct an
appropriate model to test this hypothesis.
(e)[20] Produce the appropriate diagnostic plots for the model fitted in part (a) and assess the model
assumptions. Produce the relevant influence diagnostics for this model. Which data points
appear to be influential in the analysis, and in what sense would you consider them influential?
Also, do any points appear to be outliers? If so, to what months do these observations
correspond to?
(f)[10] Refit the model in part (a), after adding all second-order terms involving only the quantitative
predictors. Test whether or not all quadratic and interaction terms can be dropped from the
regression model. State the alternatives, decision rule, and conclusion.
Dale Roberts - Australian National University
Last updated: September 27, 2019