讲解GU4206/GR5206、辅导R程序语言
- 首页 >> CSSTAT GU4206/GR5206 Sample Midterm
Gabriel
3/8/2019
The STAT GU4206/GR5206 Midterm is open notes, open book(s), open computer and online resources are
allowed. Students are required to be physically present during the exam. The TA/instructor will be available
to answer questions during the exam. Students are not allowed to communicate with any other people
regarding the exam with the exception of the instructor (Gabriel Young) and course TAs. This includes
emailing fellow students, using WeChat and other similar forms of communication. If there is any suspicion of
one or more students cheating, further investigation will take place. If students do not follow the guidelines,
they will receive a zero on the exam and potentially face more severe consequences. The exam will be posted
on Canvas at 10:05AM. Students are required to submit both the .pdf (or .html) and .Rmd files on Canvas by
12:40AM. If students fail to knit the pdf or html file, the TA will take off a significant portion of the grade.
Students will also be significantly penalized for late exams. If for some reason you are unable to upload the
completed exam on Canvas by 12:40PM, then immediately email markdown file to the course TA.
Important: If you have a bug in your code then RMarkdown will not knit. I highly recommend that you
comment out any non-working code. That way your file will knit and you will not be penalized for only
uploading the Rmd file.
1
Part I - Character data and regular expressions
Consider the following toy dataset strings_data.csv. This dataset has 461 rows (or length 461 using
readLines) and consists of random character strings.
char_data <- readLines("strings_data.csv")
head(char_data,8)
## [1] "\"strings\""
## [2] "\"rmJgFZUGKsBlvmuUOuWnFUyziiyWEEhiRROlJJXRXxOwp\""
## [3] "\"bacUqblSKDopCEAYWdgD\""
## [4] "\"qsPuSJdkmv\""
## [5] "\"RXAnEoHlliMllHMPFTcv\""
## [6] "\"SBolTFf0.2nMoQ9.454lKlgjQZGroup_IOMLFgXj\""
## [7] "\"rtoMgy0.36bRrnA9.454goQIJGroup_IMCRp\""
## [8] "\"CqdniznveOdQRhMyctjUEULimqmQjV\""
length(char_data)
## [1] 461
Among the 461 cases, several rows contain numeric digits and a specific string of the form “Group_Letter”,
where “Letter”" is a single uppercase letter. For example, the 6th element contains the symbols
“0.2”,“9.454”,“Group_I”.
char_data[6]
## [1] "\"SBolTFf0.2nMoQ9.454lKlgjQZGroup_IOMLFgXj\""
c("0.2","9.454","Group_I")
## [1] "0.2" "9.454" "Group_I"
Problem 1
Your task is to extract the numeric digits and the group variable from this character string vector. Notes:
1. The first number x is a single digit followed by a period and at least one digit. There are a few cases
where the first number is only a single digit without a period.
2. The second number y is one or two digits followed by a period and at least one digit. Note that the
second number can be negative or positive.
3. The group value is the string "Group_" followed by a single capital letter. For example "Group_I" and
"Group_S" are both elements of the third string of interest.
Once you extract all three symbols, make sure to convert the numeric digits to a numeric mode (use
as.numeric()) and organize the scrapped information in a dataframe. Your final dataframe should have 230
rows by 3 columns. The first three rows of your dataframe should look like the following output:
data.frame(x=c(0.20,0.36,0.56),
y=c(9.454,9.454,9.454),
Group=c("Group_I","Group_I","Group_I"))
## x y Group
## 1 0.20 9.454 Group_I
## 2 0.36 9.454 Group_I
## 3 0.56 9.454 Group_I
2
Solution
## Code goes here ------
Problem 2
Use both base R and ggplot to construct a scatterplot of the variables y versus x and split the colors of
the plot by the variable Group. Also include a legend, relabel the axes and include a title. Make sure the
legend doesn’t cover up the plot in base R.
Base R plot
## Code goes here ------
ggplot plot
library(ggplot2)
## Code goes here ------
Part II - Data proccessing and exploratory analysis
The data comprise of roughly 25,000 records for males between the age of 18 and 70 who are full time workers.
A variety of variables are given for each subject: years of education and job experience, college graduate (yes,
no), working in or near a city (yes, no), US region (midwest, northeast, south, west), commuting distance,
number of employees in a company, and race (African America, Caucasian, Other). The response variable is
weekly wages (in dollars). The data are taken many decades ago so the wages are low compared to current
times.
salary_data <- read.csv("salary.txt",as.is=T,header=T)
head(salary_data)
## wage edu exp city reg race deg com emp
## 1 354.94 7 45 yes northeast white no 24.3 200
## 2 370.37 9 9 yes northeast white no 26.2 130
## 3 754.94 11 46 yes northeast white no 26.4 153
## 4 593.54 12 36 yes northeast other no 9.9 86
## 5 377.23 16 22 yes northeast white yes 7.1 181
## 6 284.90 8 51 yes northeast white no 11.4 32
Below I am defining a new variable in the salary_data dataframe which computes the natural logarithm of
wages.
salary_data$log_wage <- log(salary_data$wage)
Problem 3
Use the summary() function on the salary dataset to check if the variables make sense. Specifically, one of
the continuous variables has some “funny” values. Remove the rows of the dataframe corresponding to these
strange values. If you can’t figure this question out, then move on because you can still solve Problem 4 & 5
without Problem 3.
Solution
## Code goes here ------
3
Problem 4
Using ggplot, plot log_wages against work experience, i.e., x=exp and y=log_wages. In this graphic,
change the transparency of the points so that the scatterplot does not look so dense. Note: the alpha
parameter changes the transparency. Also label the plot appropriately.
Solution
library(ggplot2)
## Code goes here ------
Notice that your graphic constructed from Problem 4 shows a quadratic or curved relationship between
log_wages against exp. The next task is to plot three quadratic functions for each race level “black”,
“white” and “other”. To estimate the quadratic fit, you can use the following function quad_fit:
quad_fit <- function(data_sub) {
return(lm(log_wage~exp+I(exp^2),data=data_sub)$coefficients)
}
quad_fit(salary_data)
## (Intercept) exp I(exp^2)
## 5.680659297 0.061220716 -0.001103711
The above function computes the least squares quadratic fit and returns coefficients aˆ1,aˆ2 and aˆ3, where
Yˆ = ˆa1 + ˆa2x + ˆa3x
2
and Yˆ = log(wage) and x = exp.
Use ggplot to accomplish this task or use base R graphics for partial credit. Make sure to include a legend
and appropriate labels.
Solution
## Code goes here ------
Part III - The Bootstrap
Data and model description
Consider a study that assesses how a drug affects someone’s resting heart rate. The study consists of n = 60
respondents. The researcher randomly places the respondents into three groups; control group and two dosage
groups (20 each). The first drug group is given 200 mg (x1) and the second drug group is given 500 mg (x2).
She then measures each respondent’s resting heart rate 1 hour after the drug was administered (Y ). She also
measures other characteristics of each respondent; age (x3), weight (x4), height (x5), gender (x6) and initial
resting heart rate before the drug was administered (x7). The statistical linear regression model is:
Y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6 + β7x7 + ,
iid∼ N(0, σ2
).
There are three dummy variables for this model:
Based on the above variable coding, the control group is described through the intercept β0.
4
Exploratory analysis
The dataset drugstudy.csv is read in below.
drugstudy <- read.table("drugstudy.txt",header=T)
head(drugstudy)
## Final.HR Initial.HR Dose1 Dose2 Age Height Weight Gender
## 1 75.1 73.6 0 0 29 73.1 251.73 1
## 2 71.6 71.7 0 0 34 72.4 151.59 1
## 3 65.5 66.5 0 0 25 67.2 133.89 1
## 4 77.2 72.7 0 0 39 69.8 154.91 1
## 5 75.8 75.8 0 0 32 72.7 186.59 1
## 6 67.9 68.7 0 0 25 66.4 205.77 1
Problem 5
Compute the average final resting heart rate for each drug group. Also compute the average initial resting
heart rate for each drug group. Display the results in dataframe or table.
Solution
## Code goes here ------
Problem 6
Construct a comparative boxplot of the respondents final resting heart rate for each drug group. Use base R
or ggplot. Make sure to label the plot appropriately.
Solution
## Code goes here ------
Nonparametric analysis (bootstrap)
Consider a nonparametric approach to assess the drug’s impact on final resting heart rate. More specifically,
the researcher is going to perform a bootstrap procedure on the following parameters:
1. β1
2. β2
3. β1 − β2
The final bootstrap intervals incorporate the three testing procedures:
1. H0 : β1 = 0 vs. HA : β1 6= 0
2. H0 : β2 = 0 vs. HA : β2 6= 0
3. H0 : β1 − β2 = 0 vs. HA : β1 − β2 6= 0
When testing β1 = 0, we are investigating the impact of the 200mg dosage group versus the control group.
Similarly, when testing β2 = 0, we are investigating the impact of the 500mg dosage group versus the control
group. The third test β1 − β2 = 0 is describing if the low dosage group has the same impact on resting heart
rate as the high dosage group.
5
Problem 7
Perform the follwong tasks!
Run a bootstrap procedure on parameters β1, β2 and β1 − β2. (i) Construct a table or dataframe displaying
the least squares estimators of βˆ
1, βˆ
2 and βˆ
1 − βˆ
2 of the original dataset, (ii) the bootstrapped standard
errors, and (iii) the bootstrap 95% confidence intervals. Use the traditional bootstrap intervals with B = 1000
boot iterations. The table should look similar to the following output:
Parameter Estimate Boot SE 95% Boot L-Bound 95% Boot U-Bound
Beta1 # # # #
Beta2 # # # #
Beta1_Beta2 # # # #
Solution
## Code goes here ------
Problem 8
Briefly interpret your results. More specifically, check if zero falls in the bootstrap intervals and conclude if
we do or do not show statistical significance.