辅导program程序、辅导R程序设计

- 首页 >> Algorithm 算法
Download the file Coursework2Data.csv from the moodle webpage and load it into R.
The dataset comprises information on the retail prices of second hand cars. The variables are:
• Price: the retail price of the second hand car (in 1000 £);
• Age: the age of the car (in months);
• Mileage: the mileage, that is the distance that the car has driven in its lifetime (in 1000 miles);
• MOT: time passed since the last MOT, a vehicle safety inspection that a registered car needs to pass
every year;
• ABS: whether the car has ABS, that is an anti-lock brake system which is an enhanced safety feature;
• Sunroof: whether the car has a sun roof.
(a) [1 mark] Produce a scatterplot of Price against Mileage.
(b) [3 marks] Consider polynomials up to degree 3 to model the relationship between Mileage and Price.
Fit each model and then add to your scatterplot from (a) the corresponding fitted lines/curves using dierent
colours/line types for each. Don’t forget to add a legend. Judging from your plot, which seems the most
appropriate model? Explain why.
(c) [3 marks] Perform a sequential ANOVA on the cubic model from (b) and include the output in your
report. What conclusion can you draw from the results?
(d) [8 marks] Explain how to use the results from (c) to compute the entries for the standard ANOVA table
for the existence of regression for the quadratic model. Write out this ANOVA table.
(e) [7 marks] Perform the test for existence of regression for the quadratic model at a 5% significance level.
In your answer state clearly
• the null and the alternative hypothesis;
• the definition of the relevant test statistic;
• the distribution of the test statistic under the null hypothesis;
• the observed value of the test statistic;
• the corresponding p-value;
• the outcome of the test and
• the conclusion you draw from the test.
(Hint: you may either use the results from (d) or use any other approach to obtain the relevant quantities,
but in the latter case explain how you obtained the relevant quantities.)
(f) [4 marks] Produce the four default diagnostic plots (Residuals vs Fitted Values plot, Normal Q-Q plot,
Scale-Location plot and Residuals versus Leverages plot) for the quadratic model. Briefly (1-2 sentences)
comment on each plot.
(g) [2 marks] Next fit the model
Pricej = —0 + —1Mileagej + —2Mileage2
j + —3Agej + —4Age2
j + ‘j ,
where j = 1,..., 172. Explain how to use the information provided in the model summary output for this
model to obtain an unbiased estimate for the variance of the errors. Give the numerical value of the unbiased
estimate for the variance of the errors.
(h) [7 marks] Perform a hypothesis test at a 5% significance level to decide whether the quadratic term in
Age is needed in the model in (g). In your answer state clearly
• the null and the alternative hypothesis;
• the definition of the relevant test statistic;
• the distribution of the test statistic under the null hypothesis;
• the observed value of the test statistic;
• the corresponding p-value;
• the outcome of the test and
• the conclusion you draw from the test.
2
(i) [2 marks] Use the function influenceIndexPlot from the car package applied to the fitted model in
(g) to produce an index plot of the leverages. Which are the observations with the six highest leverages?
(Hint: use the option list=(n=6) in the command influenceIndexPlot to label the observations with the
six highest leverages.)
(j) [3 marks] Produce a scatterplot of Age against Mileage such that the observations with the six highest
leverages identified in (i) have a dierent
colour from the other data points. How would you characterise
these observations in terms of their age and mileage?
(k) [2 marks] Produce an index plot of the Cook’s distances for the model in (g). Which are the observations
with the three highest Cook’s distances?
(l) [3 marks] For the model in (g) use the command influencePlot(model, id=list(n=3)) to produce a
bubble plot that flags up the datapoints with the three largest absolute studentised residuals, the datapoints
with the three highest leverages and the datapoints with the three highest Cook’s distance. Explain in terms
of their leverage and residual, why the observations identified in (k) have the highest Cook’s distance?
(m) [2 marks] Consider a new model produced by adding the explanatory variables MOT, ABS and Sunroof
to the model in (g). Give the R code that you would use to fit the model and to perform a hypothesis test to
decide whether the new model is a significant improvement over the model in (g).
(n) [3 marks] Perform a forward stepwise variable selection using the AIC as the model selection criterion.
Use as the minimal model
Pricej = —0 + —1Mileagej + —3Agej + ‘j for j = 1, . . . , n.
As the maximal model use the model in (m). Include the output in your report. Which model is selected as
the final model and what value does the AIC take for this model?
3

站长地图