STA442程序辅导、R编程语言辅导
- 首页 >> Algorithm 算法 STA442 Methods of Applied Statistics
Due 8 Oct 2021
1 Question 1: School
The dataset described at http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#chem97
contains chemistry test scores for 31,000 individuals from 2,200 schools in the UK. The outcome of interest is
the “Average GCSE sore” variable, a student’s final high school grade coded as ‘grade’ below. The maximum
grade is 8 (in this dataset, the original grades have been scaled somehow), and it appears that the number of
grade points below the maximum (variable ‘y’ below) looks like a Gamma distribution (don’t take my word
for it, check yourself). Each row is an individual, the first column is an id variable for region and the second
column indicates the school the subject belongs to.
The question of interest is to identify the important components of variation in GCSE scores. How important
are differences between schools, differences between regions, sex differences, or how old the student is? All
the students completed high school at age 18, it is hypothesized that those born near the end of a calendar
year (and therefore starting school a few months older than those born at the start of a calendar year) have
a slight advantage and should achieve higher grades.
Consider the code below, some of which is intended to be helpful and some of which is deliberately misleading.
xFile = Pmisc::downloadIfOld("http://www.bristol.ac.uk/cmm/media/migrated/datasets.zip")
x = read.table(grep("chem97", xFile, value = TRUE), col.names = c("region",
"school", "indiv", "chem", "sexNum", "ageMonthC", "grade"))
x$sex = factor(x$sexNum, levels = c(0, 1), labels = c("M",
"F"))
x$age = (222 + x$ageMonthC)/12
x$y = pmax(0.05, 8 - x$grade)
library("INLA")
xres = inla(y ~ 0 + age + f(region), data = x, family = "gamma",
control.family = list(hyper = list(prec = list(prior = "loggamma",
param = c(1e-04, 1e-04)))))
Pmisc::priorPostSd(xres, group = "random")$summary
mean sd 0.025quant 0.5quant 0.975quant mode
SD for region 0.1284088 0.009897604 0.1097051 0.1281781 0.1485965 0.1279285
Pmisc::priorPostSd(xres, group = "family")$summary
mean sd 0.025quant
SD parameter for the Gamma observations 0.5586705 0.002150167 0.5544532
0.5quant 0.975quant mode
SD parameter for the Gamma observations 0.5586649 0.5629152 0.5586649
1. Define a statistical model suitable for answering the question of interest, justifying your choice of
model and explaining all the various terms in your model.
1
2. Give prior distributions to all unknown model parameters and justify your choice. A plausible assump-
tion is differences between regions or schools or sexes are unlikely to cause doubling of GCSE grades,
a 50% increase or decrease of grades is rare but could well happen, changes of 20% are more likely.
3. Write a one-paragraph summary of the main results of your analysis, accompanied by a nicely formatted
table of results.
2 Question 2: Smoke
Data from the 2014 American National Youth Tobacco Survey is available on http://pbrown.ca/teaching/appliedstats/data,
where there is an R version of the 2014 dataset smoke2014.RData, a pdf documentation file
NYTS2014-Codebook-p.pdf, and the code used to create the R version of the data smokingData2014.R.
You can obtain the data with:
smokeFile = "smokeDownload2014.RData"
if (!file.exists(smokeFile)) {
download.file("http://pbrown.ca/teaching/appliedstats/data/smoke2014.RData",
smokeFile)
}
(load(smokeFile))
[1] "smoke" "smokeFormats"
The smoke object is a data.frame containing the data, the smokeFormats gives some explanation of the vari-
ables. The colName and label columns of smokeFormats contain variable names in smoke and descriptions
respectively.
smokeFormats[smokeFormats[, "colName"] == "chewing_tobacco_snuff_or",
c("colName", "label")]
colName
150 chewing_tobacco_snuff_or
label
150 RECODE: Used chewing tobacco, snuff, or dip on 1 or more days in the past 30 days
Consider the following model and set of results
# get rid of 9-11 and 19 year olds and missing age and
# race
smokeSub = smoke[which(smoke$Age >= 12 & smoke$Age <= 18 &
!is.na(smoke$Race) & !is.na(smoke$chewing_tobacco_snuff_or) &
(!is.na(smoke$Sex))), ]
smokeSub$ageFac = relevel(factor(smokeSub$Age), "15")
smokeSub$y = as.numeric(smokeSub$chewing_tobacco_snuff_or)
lincombDf = do.call(expand.grid, lapply(smokeSub[, c("ageFac",
"Sex", "Race", "RuralUrban")], levels))
lincombDf$y = -99
lincombList = inla.make.lincombs(as.data.frame(model.matrix(y ~
ageFac * Sex * RuralUrban * Race, lincombDf)))
library("INLA", quietly = TRUE)
smokeModel = inla(y ~ ageFac * Sex * RuralUrban * Race +
f(state) + f(school), lincomb = lincombList, data = smokeSub,
family = "binomial")
Warning in writeChar(lines[i], fp, nchars = nchar(lines[i]), eos = NULL):
problem writing to connection
2
knitr::kable(1/sqrt(smokeModel$summary.hyper[, c(4, 5, 3)]),
digits = 3)
0.5quant 0.975quant 0.025quant
Precision for state 0.008 0.003 0.02
Precision for school 0.803 0.637 1.00
An interesting plot is Figure 1.
smokePred = smokeModel$summary.lincomb.derived[,
paste0(c(0.5, 0.025, 0.975), 'quant')]
smokePred = exp(smokePred)/(1+exp(smokePred))
smokePred$diff = smokePred$'0.975quant' - smokePred$'0.025quant'
lincombDf$Age = as.numeric(as.character(lincombDf$ageFac))
lincombDf$ageShift = lincombDf$Age + 0.06*(as.numeric(lincombDf$Race)-2) +
0.3*(lincombDf$RuralUrban == 'Urban')
Spch = c('Rural' = 15, 'Urban' = 1)
Scol = c(black = 'black', white = 'red', hispanic='blue')
toPlot = (lincombDf$Race %in% names(Scol)) & (smokePred$diff < 0.9) &
lincombDf$Sex == 'M'
lincombDfHere = lincombDf[toPlot,]
smokePredHere = smokePred[toPlot,]
plot(
lincombDfHere$ageShift,
smokePredHere$'0.5quant',
pch = Spch[as.character(lincombDfHere$RuralUrban)],
col = Scol[as.character(lincombDfHere$Race)],
# log='y',
ylim = c(0,max(smokePredHere)),
xlab='age', ylab='prob', #yaxt='n',
yaxs='i', bty='l')
#forY = 1/c(4,10,25,100,500)
#axis(2, at=forY, mapply(format, forY), las=1)
segments(lincombDfHere$ageShift, smokePredHere$'0.025quant',
lincombDfHere$ageShift, smokePredHere$'0.975quant',
col = Scol[as.character(lincombDfHere$Race)])
legend('topleft', bty='n',
ncol = 2,
pch=c(rep(NA, length(Scol)), Spch),
lty = rep(c(1,NA), c(length(Scol), length(Spch))),
col = c(Scol, rep('black', length(Spch))),
legend=c(names(Scol), names(Spch)))
Consider the following two hypotheses:
1. State-level differences in chewing tobacco usage amongst high school students are much larger than
differences between schools within a state. Tobacco chewing is believed to be strongly regional, very
popular in some states and rare in others.
2. If American TV is an accurate reflection of reality, tobacco chewing is mostly done by Cowboys.
Cowboys are Male, white and live in rural areas. Tobacco chewing is fairly common amongst White
rural American high school students of every age, and virtually unknown for other ethnic groups and
Due 8 Oct 2021
1 Question 1: School
The dataset described at http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#chem97
contains chemistry test scores for 31,000 individuals from 2,200 schools in the UK. The outcome of interest is
the “Average GCSE sore” variable, a student’s final high school grade coded as ‘grade’ below. The maximum
grade is 8 (in this dataset, the original grades have been scaled somehow), and it appears that the number of
grade points below the maximum (variable ‘y’ below) looks like a Gamma distribution (don’t take my word
for it, check yourself). Each row is an individual, the first column is an id variable for region and the second
column indicates the school the subject belongs to.
The question of interest is to identify the important components of variation in GCSE scores. How important
are differences between schools, differences between regions, sex differences, or how old the student is? All
the students completed high school at age 18, it is hypothesized that those born near the end of a calendar
year (and therefore starting school a few months older than those born at the start of a calendar year) have
a slight advantage and should achieve higher grades.
Consider the code below, some of which is intended to be helpful and some of which is deliberately misleading.
xFile = Pmisc::downloadIfOld("http://www.bristol.ac.uk/cmm/media/migrated/datasets.zip")
x = read.table(grep("chem97", xFile, value = TRUE), col.names = c("region",
"school", "indiv", "chem", "sexNum", "ageMonthC", "grade"))
x$sex = factor(x$sexNum, levels = c(0, 1), labels = c("M",
"F"))
x$age = (222 + x$ageMonthC)/12
x$y = pmax(0.05, 8 - x$grade)
library("INLA")
xres = inla(y ~ 0 + age + f(region), data = x, family = "gamma",
control.family = list(hyper = list(prec = list(prior = "loggamma",
param = c(1e-04, 1e-04)))))
Pmisc::priorPostSd(xres, group = "random")$summary
mean sd 0.025quant 0.5quant 0.975quant mode
SD for region 0.1284088 0.009897604 0.1097051 0.1281781 0.1485965 0.1279285
Pmisc::priorPostSd(xres, group = "family")$summary
mean sd 0.025quant
SD parameter for the Gamma observations 0.5586705 0.002150167 0.5544532
0.5quant 0.975quant mode
SD parameter for the Gamma observations 0.5586649 0.5629152 0.5586649
1. Define a statistical model suitable for answering the question of interest, justifying your choice of
model and explaining all the various terms in your model.
1
2. Give prior distributions to all unknown model parameters and justify your choice. A plausible assump-
tion is differences between regions or schools or sexes are unlikely to cause doubling of GCSE grades,
a 50% increase or decrease of grades is rare but could well happen, changes of 20% are more likely.
3. Write a one-paragraph summary of the main results of your analysis, accompanied by a nicely formatted
table of results.
2 Question 2: Smoke
Data from the 2014 American National Youth Tobacco Survey is available on http://pbrown.ca/teaching/appliedstats/data,
where there is an R version of the 2014 dataset smoke2014.RData, a pdf documentation file
NYTS2014-Codebook-p.pdf, and the code used to create the R version of the data smokingData2014.R.
You can obtain the data with:
smokeFile = "smokeDownload2014.RData"
if (!file.exists(smokeFile)) {
download.file("http://pbrown.ca/teaching/appliedstats/data/smoke2014.RData",
smokeFile)
}
(load(smokeFile))
[1] "smoke" "smokeFormats"
The smoke object is a data.frame containing the data, the smokeFormats gives some explanation of the vari-
ables. The colName and label columns of smokeFormats contain variable names in smoke and descriptions
respectively.
smokeFormats[smokeFormats[, "colName"] == "chewing_tobacco_snuff_or",
c("colName", "label")]
colName
150 chewing_tobacco_snuff_or
label
150 RECODE: Used chewing tobacco, snuff, or dip on 1 or more days in the past 30 days
Consider the following model and set of results
# get rid of 9-11 and 19 year olds and missing age and
# race
smokeSub = smoke[which(smoke$Age >= 12 & smoke$Age <= 18 &
!is.na(smoke$Race) & !is.na(smoke$chewing_tobacco_snuff_or) &
(!is.na(smoke$Sex))), ]
smokeSub$ageFac = relevel(factor(smokeSub$Age), "15")
smokeSub$y = as.numeric(smokeSub$chewing_tobacco_snuff_or)
lincombDf = do.call(expand.grid, lapply(smokeSub[, c("ageFac",
"Sex", "Race", "RuralUrban")], levels))
lincombDf$y = -99
lincombList = inla.make.lincombs(as.data.frame(model.matrix(y ~
ageFac * Sex * RuralUrban * Race, lincombDf)))
library("INLA", quietly = TRUE)
smokeModel = inla(y ~ ageFac * Sex * RuralUrban * Race +
f(state) + f(school), lincomb = lincombList, data = smokeSub,
family = "binomial")
Warning in writeChar(lines[i], fp, nchars = nchar(lines[i]), eos = NULL):
problem writing to connection
2
knitr::kable(1/sqrt(smokeModel$summary.hyper[, c(4, 5, 3)]),
digits = 3)
0.5quant 0.975quant 0.025quant
Precision for state 0.008 0.003 0.02
Precision for school 0.803 0.637 1.00
An interesting plot is Figure 1.
smokePred = smokeModel$summary.lincomb.derived[,
paste0(c(0.5, 0.025, 0.975), 'quant')]
smokePred = exp(smokePred)/(1+exp(smokePred))
smokePred$diff = smokePred$'0.975quant' - smokePred$'0.025quant'
lincombDf$Age = as.numeric(as.character(lincombDf$ageFac))
lincombDf$ageShift = lincombDf$Age + 0.06*(as.numeric(lincombDf$Race)-2) +
0.3*(lincombDf$RuralUrban == 'Urban')
Spch = c('Rural' = 15, 'Urban' = 1)
Scol = c(black = 'black', white = 'red', hispanic='blue')
toPlot = (lincombDf$Race %in% names(Scol)) & (smokePred$diff < 0.9) &
lincombDf$Sex == 'M'
lincombDfHere = lincombDf[toPlot,]
smokePredHere = smokePred[toPlot,]
plot(
lincombDfHere$ageShift,
smokePredHere$'0.5quant',
pch = Spch[as.character(lincombDfHere$RuralUrban)],
col = Scol[as.character(lincombDfHere$Race)],
# log='y',
ylim = c(0,max(smokePredHere)),
xlab='age', ylab='prob', #yaxt='n',
yaxs='i', bty='l')
#forY = 1/c(4,10,25,100,500)
#axis(2, at=forY, mapply(format, forY), las=1)
segments(lincombDfHere$ageShift, smokePredHere$'0.025quant',
lincombDfHere$ageShift, smokePredHere$'0.975quant',
col = Scol[as.character(lincombDfHere$Race)])
legend('topleft', bty='n',
ncol = 2,
pch=c(rep(NA, length(Scol)), Spch),
lty = rep(c(1,NA), c(length(Scol), length(Spch))),
col = c(Scol, rep('black', length(Spch))),
legend=c(names(Scol), names(Spch)))
Consider the following two hypotheses:
1. State-level differences in chewing tobacco usage amongst high school students are much larger than
differences between schools within a state. Tobacco chewing is believed to be strongly regional, very
popular in some states and rare in others.
2. If American TV is an accurate reflection of reality, tobacco chewing is mostly done by Cowboys.
Cowboys are Male, white and live in rural areas. Tobacco chewing is fairly common amongst White
rural American high school students of every age, and virtually unknown for other ethnic groups and