代做STATS 779: Professional Skills for Statisticians 2018代做R编程

- 首页 >> Java编程

Department of Statistics

STATS 779:

Professional Skills for Statisticians

Test: May 29, 2018

2:00 pm–6:00 pm.

INSTRUCTIONS

* Total marks = 90.

* Attempt all questions.

* Note: Some questions are open-ended and it may not be clear how extensive your answer should be. Do not write long answers to these questions. You should be able to answer any question of this type in a few paragraphs at most, or within half a page.

1 The National Identity Card (NIC) number of individuals in Sri Lanka has ten unique characters. Positions 1–9 are numerical and position 10 is an alpha character. The following numbering system is used to define the first five characters:

• Positions 1–2: the year of birth. For example, 81 indicates that the birth year is 1981.

• Positions 3–5: the number of the day in the year on which the person’s birth date falls. A male would be assigned the number 1–366 and a female the number 501–866. For example, a male born on 5 January is represented by 005; a female born on the same day is represented by 505.

Example: The first five characters of the NIC for a male born on 5 January 1981 would be 81005; a female born on that same date would be 81505.

Note: The column C shows the number of the day in the year on which the person’s birth date falls. A number between 1–366.

Write down the Excel worksheet formula(s) to be entered in:

a cell B2 that extracts the birth year of the individual from the given NIC number. For example, the output in cell B2 should be 1999.

b cells D2 and E2 to obtain the birth month and day, respectively.

c cell F2 to obtain the date of birth. The output in cell F2 should follow the dd/mm/yyyy format.

d cell G2 to obtain the gender (i.e., FEMALE vs MALE).

General Tips: You will need to use the following Excel functions:

LEFT and MID functions are used to extract one or more characters from a string, either starting from the left-hand side, middle, respectively, of the string. The syntaxes of the functions are:

LEFT(text, [num_chars])

MID(text, start_num, num_chars)

text   Required. The text string that contains the characters you want to extract from.

start_num   Required. The position of the first character you want to extract.

num_chars   Optional for the LEFT function. Specifies the number of characters you want.

The VALUE function is used to convert a text string that represents a number to a number. The syntax of the function is VALUE(text) where:

text   Required. Text enclosed in quotation marks or a reference cell containing the text you want to convert.

MONTH and DAY functions can be used to find the birth month and day of the individual. The syntaxes of the functions are MONTH(serial) and DAY(serial) where:

serial   Required. A number in the date-time code.

CONCATENATE is used to join several text strings into one text string. The syntax of the function is CONCATENATE(text1, [text2], ...) where:

text1   Required. text1, text2, ... are 1 to 255 text strings to be joined into a single text string and can be text strings, numbers, or single-cell references.

[10 marks]

2 Amanda learned in her second year about the non-technical interpretation of the 95% confidence interval of the mean.

If we compute a 95% confidence interval of the mean for each sample taken from the population, then 95% of the intervals will capture the unknown population mean.

Amanda wants to visualize this as in Figure 1. You have been asked to help her with writing appropriate R code. Partial code is shown in Figure 2.

Use the given variable names to write R commands:

a In lines 20 and 23, to compute the upper and lower confidence limits, respectively, of each sample generated.

Hint: The 95% confidence interval (assuming a Gaussian distribution) is given by

where ¯x and s are sample mean and standard deviation, respectively, of n observations, α is the significance level, and tα/2,n−1 is the t-critical value from the t-distribution with n − 1 degrees of freedom.

b In line 26 to plot a blue vertical line for the population mean.

c In line 29 to annotate the line drawn in part 2b.

Hint: The mtext function is useful.

d In lines 32–40 to draw the confidence interval for each sample. Set col = "gray" if the confidence intervals capture the unknown population mean and set col = "red" other-wise.

[11 marks]

Figure 1: Non-technical explanation of the 95% confidence interval.

Figure 2: Partial R code.

3 A general system of m linear equations with n unknowns can be written in matrix notation as

Ax = b

where A is an m × n matrix of coefficients, x is an n × 1 vector of unknowns and b is an m × 1 vector of constants.

If the matrix A is square (i.e., m = n) and has full rank (i.e., determinant of the matrix A is non-zero), then the system has a unique solution given by

x = A−1 b.

An incomplete R function is given in Figure 3. Fill the appropriate R commands in lines 4, 8, 12, and 16.

Hint: You can use the det function to find the determinant of matrix A.

[5 marks]

4 Tom and Jerry have been tasked to count the number of times the word “as” appears in a given .txt file. Tom found that there are 31 matches, but is not willing to show his regex pattern. Jerry found 72 matches by setting pattern = "[aA]s(\\s|$)" in the gregexpr function.

The lecturer also said that Tom’s answer is correct.

a Write R code which uses a regular expression to find the correct number of occurrences of the word “as”. Assume that contents of the .txt file have been read into a character vector called lines.

1   # Amat: matrix of coefficients

2   # Bmat: vector of constants

3   leqDir <- function(Amat, Bmat) {

4   if() {

5   stop("Dimensions of A and b don't match")

6   }

7

8   if() {

9   stop("A should be a square matrix")

10   }

11

12   if() {

13   stop("A is a rank deficient matrix")

14   }

15

16   x <-

17   return(x)

18   }

Figure 3: An incomplete R function to solve a system of linear equations.

b Explain to Jerry why his regex pattern did not work (write a maximum of 3 lines). Suggest a few possible mismatches which could have occurred.

c Extend Jerry’s regex pattern to extract all 72 words that Jerry obtained.

[7 marks]

5 Write complete LATEX code to produce the following slides with overlays in beamer. You will have to use \clubsuit (♣), \spadesuit (♠), \heartsuit (♥), and \diamondsuit (♦) which are mathematical symbols.

Figure 4: Beamer slides with overlays.

[7 marks]

6 a In Microsoft Word, styles can be paragraph styles or character styles (plus some other types of style).

What additional features of a paragraph are specified by a paragraph style. other than those specified by a character style? State as many as you can.

b Give at least three examples of where field codes can be used in a Word document.

c Describe 3 ways in which you can produce the symbol Ω in a Word document.

[6 marks]

7 Give four examples of features in RStudio which provide support for editing R code, and explain why they are useful.

[3 marks]

8 a Write LATEX code to produce the following paragraph including the equation and reference. Equation 1 is an example of a commonly occurring format.

b Write LATEX code to produce the following sentence:

During the period of the global financial crisis, 2008–2010, the change in the Dow-Jones index was around −2000 points—a change of some 20%.

c Write LATEX commands to produce Table 1 including the caption.

Table 1: Table Example

d Write bibTEX entries to be included in a .bib file to produce the following bibliography:

References

D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67:1–48, 2015.

B. Manly. Stage-structured Populations: Sampling, Analysis and Simulation. Chap-man & Hall, New York, 1990.

[16 marks]

9 Suppose you have a data frame. called Titanic1 giving details of adult passengers and crew who sailed on the Titanic.

The columns in the data frame. are shown in the following output:

> str ( Titanic1 )

'data . frame. ': 16 obs . of 4 variables :

$ Class : Factor w / 4 levels " 1 st " ," 2 nd " ," 3 rd " ,..: 1 2 3 4 1 2 3 4 1 2 ...

$ Sex : Factor w / 2 levels " Male " ," Female " : 1 1 1 1 2 2 2 2 1 1 ...

$ Survived : Factor w / 2 levels " No " ," Yes " : 1 1 1 1 1 1 1 1 2 2 ...

$ Freq : num 118 154 387 670 4 13 89 3 57 14 ...

Write R code to produce Figure 5 using ggplot2.

[8 marks]

Figure 5: Survival on the Titanic

10 Suppose that in a .Rnw file you have created an R object using xtable called xtbl as shown below:

> class(xtbl)

[1] "xtable" "data.frame"

> str(xtbl)

Classes 'xtable' and 'data.frame': 5 obs. of 1 variable:

$ x: int 34 40 15 10 1

- attr(*, "caption")= chr "xtable example"

- attr(*, "label")= chr "tab:xtbl"

- attr(*, "align")= chr "r" "r"

- attr(*, "digits")= num 0 2

- attr(*, "display")= chr "s" "d"

What will be the effect of the following snippets of text when the .Rnw file is processed using knitr and pdfLATEX:

a <>=

xtbl

@

b <>=

xtbl

@

NOTE: You may wish to examine the help pages for the package xtable before answering this question.

[6 marks]

11 The results of this years Giro d’Italia cycle tour race are in the file GiroResults.csv, which has the form. shown in Figure 6.

Figure 6: Top of GiroResults.csv

Rider names are not more than 30 characters long, and team names are not more than 50 characters long. In the column headed Time the first entry (for Chris Froome) gives the total time taken to ride the the 21 stages of the tour, in hours, minutes and seconds. The other figures in that column are the additional times that the various riders took to complete the tour. So for example, George Bennett of New Zealand took an additional 13 minutes and 17 seconds compared to Froome, that is, his total riding time was 89 hours, 15 minutes and 56 seconds.

a Write MySQL code to create a table called giro for this data set. Do not create an automatically incremented variable as the primary key for the data. Instead specify the rider name as the primary key.

b Write MySQL code to read the data from GiroResults.csv into the table giro.

c Alter the table giro by adding a TIME variable called Difference.

d Update the column by first setting Difference to be equal to the Time column and then update the first element of the Difference column (the entry for Froome) to take the value ’00:00:00’.

If this has been done correctly then the Difference column will contain all the time differences from Froome’s time.

e Write MySQL code to produce a table showing the average time difference by team, in minutes rounded to 2 decimal places, ordered from smallest to largest.

NOTE: To carry out calculations involving times, first convert times to seconds by ap-plying the function TIME_TO_SEC.

[10 marks]





站长地图