STAT4038留学生辅导、讲解data

- 首页 >> CS


STAT2008/STAT4038/STAT6014/STAT6038 Assignment 2 Page 1 of 2

Regression Modelling - Assignment 2

Total of 100 Marks

Due on 11/10/2019 23:59

Question 1 [40 marks]

We consider the Study on the Efficacy of Nosocomial Infection Control (SENIC Project) where data

was collected to determine whether infection surveillance and control programs have reduced the

rates of nosocomial (hospital acquired) infection in United States hospitals. This data set consists of

a random sample of 113 hospitals selected from the original 338 hospitals surveyed.

Each line of the data set has an identification number and provides information on 11 other variables

for a single hospital. The data presented here are for the 1975-76 study period. The 12 variables are:

Identification number [Id]; Average length of stay of all patients (in days) [Length]; Average age of

patients (in years) [Age]; Average estimated probability of acquiring infection in hospital (in percent)

[Risk]; Ratio of number of cultures performed to number of patients without signs or symptoms of

hospital-acquired infections, times 100 [Culture]; Ratio of number of X-rays performed to number

of patients without signs or symptoms of pneumonia, times 100 [Xray]; Average number of beds

in hospital during study period [Beds]; Medical school affiliation (1=Yes, 2=No) [Affiliation];

Geographic region, where: 1=NE, 2=NC, 3=S, 4=W [Region]; Average number of patients in

hospital per day during study period [Patients]; Average number of full-time equivalent registered

and licensed practical nurses during study period (number of full time plus one half the number of

part time) [Nurses]; Percent of 35 potential facilities and services that are provided by the hospital

[Facilities].

(a)[6] Fit a multiple linear regression (MLR) model with Risk as the response variable and all other

covariates (excluding Id) as predictors. Is the regression model significant?

(b)[8] What are the estimated coefficients of the (MLR) model in part (a) and the standard errors

associated with these coefficients? Interpret the values of these estimated coefficients with

regards to model specification.

(c)[6] There is a t-test associated with each of these coefficients. Briefly explain, what these tests

can or cannot be used for? In your answer, be sure to mention the appropriate hypotheses that

can be assessed using these t-tests.

(d)[6] Construct an appropriate test of the hypothesis that Age and Beds are not significant contributors

to the model. That is, test βAge = βBeds = 0.

(e)[8] Imagine you are doing this work as a data scientist at a hospital and a (statistical uneducated)

colleague suggests that a model with coefficients βLength = 0.25, βCulture = 0.05, and

βRegion = 0.3 may be a better model. How would you fit such a model and what would be the

estimate of the intercept term with these coefficients? What criticisms do you have about this

suggested model?

(f)[6] The Asheville, N.C-based Mission Hospital is making progress on a 12-story surgery tower that will

house more than 400 beds. They would like a prediction on the expected risk of infection within

this new extension if Length=11, Age=45, Culture=18, Xray=100, Beds=400, Region=2,

Patients=400, Nurses=340, Facilities=52, Affiliation=1.

What do you predict the risk of infection to be? Find a 99% interval for this prediction.

Dale Roberts - Australian National University

Last updated: September 27, 2019

STAT2008/STAT4038/STAT6014/STAT6038 Assignment 2 Page 2 of 2

Question 2 [60 marks]

Company executives from a large packaged foods manufacturer wished to determine which factors

influenced the market share of one of its products. Data were collected from a national database

(Nielson) for 36 consecutive months. Each line of the data set has an identification number and

provides information on 6 other variables for each month. The variables are: Identification number

[Id]; Average monthly market share for product (percent) [Share]; Average monthly price of

product (dollars) [Price]; An index of the amount of advertising exposure that the product received

[Exposure]; Presence or absence of discount price during period: 1 if discounted, 0 otherwise

[Discounted]; Presence or absence of package promotion during period: 1 if promotion present, 0

otherwise [Promoted]; Month [Month]; Year [Year]. The data was collected during September 1999

(Id = 1), October 1999 (Id = 2), . . . , August 2002 (Id = 36).

(a)[10] Fit a multiple linear regression (MLR) model with Share as the response variable and all

other covariates as predictors (excluding Id). Is the regression model significant? Interpret

the coefficients for the categorical variables in this model. Does the coefficient support the

expectations that discounting the price increases market share?

(b)[6] The executives are interested to know if discounting and package promotions have an effect on

market share. Conduct a formal test of the hypothesis that

βDiscounted = βPromoted = 0

using an appropriate ANOVA table. Evaluate the F-statistic and the corresponding p-value.

(c)[6] Assuming that the other variables remain fixed, the company executives would like to know

your prediction in difference of market share over the month if they discount the price and also

promote their product. Base your answer on the model fitted in part (a).

(d)[8] One executive suggests that, in his opinion, discounting the product and promoting the product

have a similar effect on market share so the company should pursue the strategy that costs the

least. Test whether the coefficients of Discounted and Promoted are the same. Construct an

appropriate model to test this hypothesis.

(e)[20] Produce the appropriate diagnostic plots for the model fitted in part (a) and assess the model

assumptions. Produce the relevant influence diagnostics for this model. Which data points

appear to be influential in the analysis, and in what sense would you consider them influential?

Also, do any points appear to be outliers? If so, to what months do these observations

correspond to?

(f)[10] Refit the model in part (a), after adding all second-order terms involving only the quantitative

predictors. Test whether or not all quadratic and interaction terms can be dropped from the

regression model. State the alternatives, decision rule, and conclusion.

Dale Roberts - Australian National University

Last updated: September 27, 2019



站长地图