PSTAT 115程序辅导、R编程语言调试、data留学生程序讲解
- 首页 >> Java编程 Homework 3
PSTAT 115, Fall 2020
Due on November 8, 2020 at 11:59 pm
Note: If you are working with a partner, please submit only one homework per group with both names
and whether you are taking the course for graduate credit or not. Submit your Rmarkdown (.Rmd) and the
compiled pdf on Gauchospace.
Problem 1. Cancer Research in Laboratory Mice
As a reminder from homework 2, a laboratory is estimating the rate of tumorigenesis (the formation of
tumors) in two strains of mice, A and B. They have tumor count data for 10 mice in strain A and 13 mice in
strain B. Type A mice have been well studied, and information from other laboratories suggests that type
A mice have tumor counts that are approximately Poisson-distributed. Tumor count rates for type B mice
are unknown, but type B mice are related to type A mice. Assuming a Poisson sampling distribution for
each group with rates θA and θB. We assume θA ∼ gamma(120, 10) and θB ∼ gamma(12, 1). We observe
yA = (12, 9, 12, 14, 13, 13, 15, 8, 15, 6) and yB = (11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7). Now we will actually
investigate evidence that Type A mice are have higher rates of tumor formation than Type B mice.
a. For n0 ∈ {1, 2, ..., 50}, obtain P r(θB < θA | yA, yB) via Monte Carlo sampling for θA ∼ gamma(120, 10)
and θB ∼ gamma(12 × n0, n0). Make a line plot of P r(θB < θA | yA, yB) vs n0. Describe how sensitive
the conclusions about the event {θB < θA} are to the prior distribution on θB.
y_A <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
y_B <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
# YOUR CODE HERE
Type your answer here, replacing this text.
b. Repeat the previous part replacing the event {θB < θA} with the event {Y˜B < Y˜A}, where Y˜A and Y˜B
are samples from the posterior predictive distribution.
y_A = c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
y_B = c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
# YOUR CODE HERE
Type your answer here, replacing this text.
c. In the context of this problem, describe the meaning of the events {θB < θA} and {Y˜B < Y˜A}. How are
they different?
Type your answer here, replacing this text.
2. Posterior Predictive Model Checking
Model checking and refinement is an essential part of Bayesian data analysis. Let’s investigate the adequacy
of the Poisson model for the tumor count data. Consider strain A mice only for now, and generate posterior
predictive datasets y
(1)
A , ..., y
(1000)
A . Each y
(s)
A is a sample of size nA = 10 from the Poisson distribution with
parameter θ
(s)
A , θ
(s)
A is itself a sample from the posterior distribution p(θA | yA) and yA is the observed data.
For each s, let t
(s) be the sample average divided by the sample variance of y
(s)
A .
1
a. If the Poisson model was a reasonable one, what would a “typical” value t
(s) be? Why?
y_A = c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
# YOUR CODE HERE
Type your answer here, replacing this text.
b. In any given experiment, the realized value of t
s will not be exactly the “typical value” due to sampling
variability. Make a histogram of t
(s) and compare to the observed value of this statistic, mean(yA)
var(yA)
.
Based on this statistic, make a comment on if the Poisson model seems reasonable for these data (at
least by this one metric).
# YOUR CODE HERE
Type your answer here, replacing this text.
c. Repeat the part b) above for strain B mice, using YB and nB = 13 to generate the samples. Assume
the prior distribution p(θB) ∼ Gamma(12, 1). Again make a comment on the Poisson model fit.
y_B = c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
# YOUR CODE HERE
3. Interval estimation with rejection sampling.
a. Use rejection sampling to sample from the following density:
p(x) = 1
4
|sin(x)| × I{x ∈ [0, 2π]}
Use a proposal density which is uniform from 0 to 2π and generate at least 1000 true samples from p(x).
Compute and report the Monte Carlo estimate of the upper and lower bound for the 50% quantile interval
using the quantile function on your samples. Compare this to the 50% HPD region calculated on the
samples. What are the bounds on the HPD region? Report the length of the quantile interval and the total
length of the HPD region. What explains the difference? Hint: to compute the HPD use the hdi function
from the HDInterval package. As the first argument pass in density(samples), where samples is the name
of your vector of true samples from the density. Set the allowSplit argument to true and use the credMass
argument to set the total probability mass in the HPD region to 50%.
b. Plot p(x) using the curve function (base plotting) or stat_function (ggplot). Add lines corresponding
to the intervals / probability regions computed in the previous part to your plot using them segments
function. To ensure that the lines don’t overlap visually, for the HPD region set y0 and y1 to 0 and for
the quantile interval set y0 and y1 to 0.01. Make the segments for HPD region and the segment for
quantile interval different colors. Report the length of the quantile interval and the total length of the
HPD region, verifying that indeed the HPD region is smaller.
# Rejection sampling and interval construction
# YOUR CODE HERE
# hd_region is the result of calling hdi function
hd_region <- NULL # YOUR CODE HERE
print(hd_region)
print(sprintf("Total region length: %.02f", sum(hd_region[, "end"] - hd_region[,"begin"])))
quantile_interval <- NULL # YOUR CODE HERE
print(quantile_interval)
print(sprintf("Total region length: %.02f", quantile_interval[2] - quantile_interval[1]))
2
## Make the plot
# YOUR CODE HERE
3
PSTAT 115, Fall 2020
Due on November 8, 2020 at 11:59 pm
Note: If you are working with a partner, please submit only one homework per group with both names
and whether you are taking the course for graduate credit or not. Submit your Rmarkdown (.Rmd) and the
compiled pdf on Gauchospace.
Problem 1. Cancer Research in Laboratory Mice
As a reminder from homework 2, a laboratory is estimating the rate of tumorigenesis (the formation of
tumors) in two strains of mice, A and B. They have tumor count data for 10 mice in strain A and 13 mice in
strain B. Type A mice have been well studied, and information from other laboratories suggests that type
A mice have tumor counts that are approximately Poisson-distributed. Tumor count rates for type B mice
are unknown, but type B mice are related to type A mice. Assuming a Poisson sampling distribution for
each group with rates θA and θB. We assume θA ∼ gamma(120, 10) and θB ∼ gamma(12, 1). We observe
yA = (12, 9, 12, 14, 13, 13, 15, 8, 15, 6) and yB = (11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7). Now we will actually
investigate evidence that Type A mice are have higher rates of tumor formation than Type B mice.
a. For n0 ∈ {1, 2, ..., 50}, obtain P r(θB < θA | yA, yB) via Monte Carlo sampling for θA ∼ gamma(120, 10)
and θB ∼ gamma(12 × n0, n0). Make a line plot of P r(θB < θA | yA, yB) vs n0. Describe how sensitive
the conclusions about the event {θB < θA} are to the prior distribution on θB.
y_A <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
y_B <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
# YOUR CODE HERE
Type your answer here, replacing this text.
b. Repeat the previous part replacing the event {θB < θA} with the event {Y˜B < Y˜A}, where Y˜A and Y˜B
are samples from the posterior predictive distribution.
y_A = c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
y_B = c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
# YOUR CODE HERE
Type your answer here, replacing this text.
c. In the context of this problem, describe the meaning of the events {θB < θA} and {Y˜B < Y˜A}. How are
they different?
Type your answer here, replacing this text.
2. Posterior Predictive Model Checking
Model checking and refinement is an essential part of Bayesian data analysis. Let’s investigate the adequacy
of the Poisson model for the tumor count data. Consider strain A mice only for now, and generate posterior
predictive datasets y
(1)
A , ..., y
(1000)
A . Each y
(s)
A is a sample of size nA = 10 from the Poisson distribution with
parameter θ
(s)
A , θ
(s)
A is itself a sample from the posterior distribution p(θA | yA) and yA is the observed data.
For each s, let t
(s) be the sample average divided by the sample variance of y
(s)
A .
1
a. If the Poisson model was a reasonable one, what would a “typical” value t
(s) be? Why?
y_A = c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
# YOUR CODE HERE
Type your answer here, replacing this text.
b. In any given experiment, the realized value of t
s will not be exactly the “typical value” due to sampling
variability. Make a histogram of t
(s) and compare to the observed value of this statistic, mean(yA)
var(yA)
.
Based on this statistic, make a comment on if the Poisson model seems reasonable for these data (at
least by this one metric).
# YOUR CODE HERE
Type your answer here, replacing this text.
c. Repeat the part b) above for strain B mice, using YB and nB = 13 to generate the samples. Assume
the prior distribution p(θB) ∼ Gamma(12, 1). Again make a comment on the Poisson model fit.
y_B = c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
# YOUR CODE HERE
3. Interval estimation with rejection sampling.
a. Use rejection sampling to sample from the following density:
p(x) = 1
4
|sin(x)| × I{x ∈ [0, 2π]}
Use a proposal density which is uniform from 0 to 2π and generate at least 1000 true samples from p(x).
Compute and report the Monte Carlo estimate of the upper and lower bound for the 50% quantile interval
using the quantile function on your samples. Compare this to the 50% HPD region calculated on the
samples. What are the bounds on the HPD region? Report the length of the quantile interval and the total
length of the HPD region. What explains the difference? Hint: to compute the HPD use the hdi function
from the HDInterval package. As the first argument pass in density(samples), where samples is the name
of your vector of true samples from the density. Set the allowSplit argument to true and use the credMass
argument to set the total probability mass in the HPD region to 50%.
b. Plot p(x) using the curve function (base plotting) or stat_function (ggplot). Add lines corresponding
to the intervals / probability regions computed in the previous part to your plot using them segments
function. To ensure that the lines don’t overlap visually, for the HPD region set y0 and y1 to 0 and for
the quantile interval set y0 and y1 to 0.01. Make the segments for HPD region and the segment for
quantile interval different colors. Report the length of the quantile interval and the total length of the
HPD region, verifying that indeed the HPD region is smaller.
# Rejection sampling and interval construction
# YOUR CODE HERE
# hd_region is the result of calling hdi function
hd_region <- NULL # YOUR CODE HERE
print(hd_region)
print(sprintf("Total region length: %.02f", sum(hd_region[, "end"] - hd_region[,"begin"])))
quantile_interval <- NULL # YOUR CODE HERE
print(quantile_interval)
print(sprintf("Total region length: %.02f", quantile_interval[2] - quantile_interval[1]))
2
## Make the plot
# YOUR CODE HERE
3