讲解MATH4007、STATISTICS辅导、R编程设计辅导、讲解R语言

2020.04.10 - 首页 >> CS

MATH4007 COMPUTATIONAL STATISTICS
Assessed Coursework 2 — 2019/2020
The deadline for this work is 3pm, Monday 30 March 2020, to be submitted via the
“Coursework 2 Submission” link on Moodle. Unauthorised late submission will be penalised by
5% of the full mark per day. Work submitted more than one week late will receive zero marks.
All components (see below) must be submitted by the deadline for the work to be considered
on time. You are reminded to familiarise yourself with the guidelines concerning plagiarism in
assessed coursework (see the student handbook), and note that this applies equally to computer
code as it does to written work. The submission should contain:
1. A pdf file containing any computational results (plots/relevant output) and discussion. This
can be produced using e.g. R Markdown, or by copying output into a Word document.
Please convert any documents to pdf for uploading.
2. A pdf of your theoretical working. A scan of handwritten work is fine, but you could also
typeset using Latex if you prefer. If it’s more convenient, you can combine this and the
above part into one, e.g. if you wish to put everything in one Latex document, but this is
not required.
3. An R script file, i.e. with a .r extension containing your R code. This should be clearly
formatted, and include brief comments so that a reader can understand what it is doing.
The code should also be ready to run without any further modification by the user, and
should reproduce your results (approximately, for simulation-based results).
Please make sure that all required working, results, details of implementation and discussion are
contained in components 1 and 2 of the above list and not in the script file. The script file
will only be used for verification of results. The exception is for the R code itself, whereby it is
sufficient to say “refer to script file” where a question asks you to write R code.
1. Density estimation is one situation where the common bias-variance trade-off occurs. In
this question, you’ll conduct a simulation study to investigate this, when the true value of
a density is known.
Consider the mixture density
f(x; µ0, µ1, p) = (1 − p)f0(x; µ0) + pf1(x; µ1), x ∈ R,
where p ∈ (0, 1), µ0 ∈ R, µ1 ∈ R are parameters and f0 and f1 are normal densities with
variance 1 and mean µ0 and µ1 respectively. Let p = 0.3, µ0 = 0, µ1 = 3.
(a) Describe how to sample from f using the “sequential” method, and write a function
in R to implement this.
(b) Use your function to simulate 100 samples from f. Plot a normalized histogram of your
samples, and overlay lines showing the true density and a kernel desnity estimate. (You
can use the density function, with a Gaussian kernel and the default bandwidth.)
Repeat this to get plots for 4 different samples, each of size 100. Comment on your
plots.
(c) Simulation can be used to investigate the bias, variance and mean-square error of the
density estimate ˆf(x). These are defined as
Bias E[ˆf(x)] − f(x)
Variance Var[ˆf(x)]
Mean-square error E[( ˆf(x) − f(x))2]
Investigate this at the single point x = 2, by doing the following.
(i) Fix a bandwidth.
(ii) Simulate 100 samples from f.
(iii) Compute and store the kernel density estimate of f(2) using your sample and
fixed bandwidth.
(iv) Repeat steps (ii) and (iii) a large number of times, which will give a large sample
from the sampling distribution of ˆf(2) for the given bandwidth.
You can then use your samples to estimate the bias, variance and mean-square error
of ˆf(2) for the given bandwidth.
Conduct a simulation study to investigate the bias, variance and mean square error
of the kernel density estimator of f(2). The objective of the study is to explore
the behaviour for different bandwidths. Use a Gaussian kernel, and you may use the
density function to compute the kernel density estimates. Produce plots of bias
against variance and mean-square error against bandwidth. Discuss your findings.
[15]
2. In a study to monitor glucose levels of patients with diabetes, 10 patients were fitted
with continuous glucose monitoring sensors to track glucose levels. If the sensors detect
abnormal glucose levels, this triggers an automatic intervention to change the rate of insulin
infusion. The following data show the number of days the patients were monitored for (the
follow-up time) and the number of instances where abnormal glucose levels were detected
so that an automatic infusion intervention was required.
Unfortunately, patients 9 and 10 had faulty sensors which failed at some point before the
follow-up time, so patients had to manage insulin levels manually. Therefore, all that is
known is that the true number of abnormal glucose level instances (counts) at the follow
up time is at least as large as the count in the table (indicated by ∗), which is the number
of counts at the point when the sensor failed.
Patient 1 2 3 4 5 6 7 8 9 10
Follow-up time (xi) 94 16 63 126 5 1 1 2 10 31
Count (yi) 5 1 5 14 3 1 1 4 15∗ 11∗
The following hierarchical model is proposed to model these data.
yi|xi, λi ∼ Poisson(λixi), i = 1, . . . , 10λi|α, β ∼ Gamma(α, β) i = 1, . . . , 10β|a, b ∼ Gamma(a, b)
using the parameterization that the density of a random variable Z which follows a Gamma
distribution with parameters a and b is
where Γ(·) is the Gamma function. Hence, λi represents the rate of abnormal glucose level
instances per day for the i
th individual. The yi are assumed independent given λi and the λi
are assumed independent given α and β. After consultation with the medical professionals,
the parameters α = 2, a = 0.01 and b = 1 can be considered as fixed. The parameter β
therefore controls the distribution of the rates of abnormal glucose level instances over the
larger population from which these patients were taken, and hence is the main object of
inference.
Let λ = (λ1, . . . , λ10) and denote all the observed counts by yobs. Also let
y = (y1, y2, . . . , y9, y10) denote the vector of all the true counts, including the unknown y9
and y10, and let x = (x1, . . . , x10).
(a) By augmenting the observed data with the 2 missing true counts for patients 9 and
10, give full details of the Gibbs sampler to sample from the posterior distribution
π(λ, β|yobs).
HINT: The overall target distribution is
π(y9, y10,λ, β|x, yobs) ∝ π(y|λ, β, x)π(λ, β), y9 ≥ 15, y10 ≥ 11
where the proportionality means as a function of all the unknown variables y9, y10,λ, β.
(b) Implement your Gibbs sampler in R. You may use any inbuilt R functions to sample
from the full conditional distributions, but you must write the main algorithm yourself.
Describe what steps you take to be confident that the MCMC has converged and is
giving reliable estimates.
(c) Use your samples to plot a density estimate of the marginal posterior density of β,
π(β|yobs).
(d) Produce a 95% posterior probability interval for β from your samples using
(i) A normal approximation of π(β|yobs) with mean and standard deviation estimated
from your samples.
(ii) The 2.5% and 97.5% quantiles of your samples. You can use the quantile
function for this.
With reference to your plot in (c), comment on any differences between the two
intervals.
(e) A new individual is about to monitor their glucose levels for 50 days using a sensor.
Let y∗ be the count of abnormal glucose levels in 50 days for the new individual.
Describe how to use your posterior samples of β to produce samples y∗
from the
predictive distribution π(y∗|yobs) = Rπ(y∗|β)π(β|yobs)dβ.
HINT: You do not need to attempt to work out any integrals theoretically. This is
purely about simulation. Your samples of β from the Gibbs sampler are from the
marginal posterior π(β|yobs). How can you use these to sample the joint distribution
π(y∗, λ, β|yobs), using the given hierarchical model, and how does this help?
(f) Implement this to obtain samples of y∗
from π(y∗|yobs), and use your samples to
estimate the probability that the new individual will need 50 or more automatic
interventions over the 50 days of monitoring (assuming their sensor doesn’t fail).
[25]
3