STAT6123辅导、R编程设计辅导

2022.11.29 - 首页 >> Database作业

STAT6123 Generalised Linear Modelling
Assignment
This assignment is worth 50% of the overall mark for STAT6123.
The deadline for submission is 16.00 on Thursday 1 December 2022.
Standard University policies and procedures will be followed for late submission,
extensions and academic integrity (see the Module Outline for details).
Submission is via Blackboard.
– You should submit a report containing your answers via TurnitinUK on Black-
board (see Module Outline for details) in a file called report-ID.pdf, where ID
is your student ID number, for example report-1234567.pdf. In the STAT6123
Assignments folder, click on View/Complete to submit your report. Please enter
this file name as the Submission title.
– You should not include R code used in your analysis in your report, but you
must submit a separate R script via Blackboard containing your code called
code-ID.R, for example code-1234567.R. This code should reproduce the re-
sults contained in your report. Please rename and use the R template code-yyy.R
provided. In the STAT6123 Assignments folder, click on Assignment code sub-
mission to submit your code.
? The page limits given below for each task are strict and is easily sucient to receive
full credit. Any pages beyond the limits will not be marked.
1

Task 1 [Total 65 marks, max. 9 pages]
How household expenditure varies with household income and other variables is of key
interest in socio-economic studies. In order to investigate this, you are provided with
data from a survey that collected high quality data on expenditure, income and other
socio-economic variables. Your task is to use these data to develop a model to explain
variation in household expenditure. A description of the available variables is presented
below. The data (1200 observations) is included in the file expenditure.txt (available on
Blackboard).
id Household identifier
expenditure Total household expenditure (GBP) [Includes food, clothing, transport,
housing, education, etc.]
income Gross weekly average household income (GBP)
house.ten Household tenure: 1 = Public rented, 2 = Private rented, 3 = Owned
sex.hh Sex of the household head: 1 = Male, 2 = Female
lab.force Employment status: 1 = Full time working, 2 = Part time working,
3 = Unemployed, 4 = Economically inactive
hh.size Household size: 1 = 1 person, 2 = 2 persons, 3 = 3 persons,
4 = 4 persons, 5 = 5 persons or more
hh.adults Number of adults in the household: 1 = 1 adult, 2 = 2 adults,
3 = 3 adults, 4 = 4 adults or more
1. Produce and briefly discuss appropriate tables or plots to assess the distribution of
expenditure and the relationship between expenditure and income, house.ten,
sex.hh, lab.force, hh.size and hh.adults.
[10 marks]
2. Regress expenditure on income and present the estimated coecients and their
standard errors. Assess the regression assumptions using appropriate plots.
[4 marks]
3. Regress expenditure on income and income squared, and present the estimated
coecients and their standard errors. Assess the regression assumptions using ap-
propriate plots.
[4 marks]
4. Regress the natural logarithm of expenditure on income, and present the estimated
coecients and their standard errors. Assess the regression assumptions using ap-
propriate plots.
[4 marks]
5. Regress the natural logarithm of expenditure on income and income squared, and
present the estimated coecients and their standard errors. Assess the regression
assumptions using appropriate plots.
[4 marks]
2
6. Which of the above four models best describe the relationship between expenditure
and income? Justify your answer and summarise the relationship between expenditure
and income based on your preferred model.
[4 marks]
7. By considering the addition of the other variables and interactions to your preferred
model from question 6, propose a suitable regression model for expenditure. Doc-
ument your model building process and use diagnostic tools to assess the fit of your
model.
[18 marks]
8. Describe the relationship between expenditure and the explanatory variables in
your model.
[12 marks]
9. Up to 5 marks will be allocated for general presentation of the results in the report.
[5 marks]
3
Task 2 [Total 35 marks, max. 2 pages]
For this task you need to (a) submit R code using the R template, which will be used
to replicate your answers, and (b) include the answers to the questions below in your
report. You are not allowed to use existing R functions that fit models. However,
you are allowed to use other R functions, for example, those required for matrix algebra
computations.
The dataset for this task includes data on the number of days ahead travellers purchase
their airline tickets (y) and the distance in kilometres they plan to travel (x). The data
file (available on Blackboard) is called airline.txt and contains 1000 records on the two
variables y and x. One way to model the number of days ahead travellers purchase their
airline tickets is by using a distribution with probability density function (p.d.f.) given in
(1), where denotes the parameter of interest:
f(y; ) = ey, y > 0, ? > 0. (1)
The p.d.f. in (1) is a member of the exponential family with the following components
(using the same notation as in the lecture notes):
Let Y1, . . . , Yn be independent random variables from (1) and assume that the mean number
of days, μi, can be modelled as a function of distance xi using the following link function
and systematic component, log μi = 0 + 1xi, with μi = E(Yi), for i = 1, . . . , n.
1. Use the expressions provided in the lecture notes and the information above to obtain
the score u() and the information I(), where = (0, 1)T , under the link function
and systematic component specified above. Present your derivations and report the
score and information.
[10 marks]
2. Using the score and the information, write R code that implements the Fisher scoring
algorithm to fit a glm to dataset airline.txt under the distribution specified in
(1) with the link function and systematic component specified above. Obtain the
maximum likelihood estimate (m.l.e.) of = (0, 1). Report the point estimates of
the model parameters.