MATH20811辅导、讲解data analysis

2019.10.29 - 首页 >> CS

Coursework 1 – Exploratory data analysis and correlation

MATH20811 Practical Statistics: Coursework 1

The marks awarded for this coursework constitute 30% of the total assessment for the module.

Your solution to the coursework should be a consice report (max 10 pages) and it should take, on

average, about 15 hours to complete.

The submission deadline is 10am on Monday 28 October 2019.

Please note that this deadline is a strict one with a University set penalty of 10% of the total

marks applied for each day late up to a maximum of five days, after which your mark for the

coursework will be zero.

Your submitted solutions should all be in one document. This must be prepared using LaTeX.

For each part of the question you should provide explanations as to how you completed what is

required, show your workings and also comment on computational results, where applicable.

When you include a plot, be sure to give it a title and label the axes correctly.

When you have written or used R code to answer any of the parts, then you should list this R code

after the particular written answer to which it applies. This may be the R code for a function you

have written and/or code you have used to produce numerical results, plots and tables. R code

should also be clearly annotated.

Avoid using screenshots of R code/output. Instead, to include R code use the verbatim environment

and summarise R output in tables using the table environment, as demonstrated in the solution of

Example Sheet 2.

Your file should be submitted through the module site on Blackboard to the Turnitin assessment

in the Coursework folder entitled “MATH20811 CW1” by the above time and date. The work

will be marked anonymously on Blackboard so please ensure that your filename is clear but that

it does not contain your name and student id number. Similarly, do not include your name and

id number in the document itself.

Turnitin will generate a similarity report for your submitted document and indicate matches to

other sources, including billions of internet documents (both live and archived), a subscription

repository of periodicals, journals and publications, as well as submissions from other students.

Please ensure that the document you upload represents your own work and is written in your own

words. The Turnitin report will be available for you to see shortly after the due date.

This coursework should hopefully help to reinforce some of the methodology you have been studying,

as well as the skills in R you have been developing in the module. Correct interpretation and

meaningful discussion of the results (i.e. attempt to put the results into context) are as important

as correct calculation of the results, in order to achieve a high mark for the coursework.

Coursework 1 – Exploratory data analysis and correlation

The data in red_wine.csv and white_wine.csv (Cortez et al, 2009) contain various measurements

on red and white variants of the Portuguese Vinho Verde wine. Import the data in R and

save them as objects red_wine and white_wine. Each object should contain measurements on

11 continuous variables: fixed.acidity, volatile.acidity, citric.acid, residual.sugar,

chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol

and one discrete variable: quality.

1. Perform exploratory analysis of the data and report some interesting findings about the

data. Some suggestions include producing summary statistics of the data, comparing the

distributions of specific variables for each of the red and white variants using histograms or

box-plots (as appropriate) and exploring any associations between the variables, in particular

alcohol and quality. [10]

2. Using the function cor, calculate both Pearson’s and Spearman’s correlation between:

• white_wine$chlorides and white_wine$alcohol

• log(white_wine$chlorides) and white_wine$alcohol

Comment on the results and give an explanation for any discrepancies between the various

correlation estimates. Hint: Inspecting the scatterplots for each pair might be useful. [5]

3. Let ρ1 be Pearson’s correlation between alcohol and density for the red wine dataset. Using

the function cor.test, test the hypothesis H0 : ρ1 = 0 vs HA : ρ1 6= 0 and report

your findings. Calculate (DIY) an approximate 95% confidence interval (CI) for ρ1

based on Fisher’s z-transform and verify your calculations agree with the CI produced by

cor.test. [5]

4. Perform (DIY) a hypothesis test for H0 : ρ1 = −0.5 vs HA : ρ1 > −0.5 at 2.5% significance

level, using Fisher’s z-transform. Compute the p-value and use it to decide whether to reject

the null hypothesis in favour of the alternative. [5]

5. Write a function in R to verify via simulation that the distribution of the Fisher’s ztransform

statistic is approximately Normal. Your function should output a plot comparing

the sampling distribution of Fisher’s z-transform statistic and the appropriate Normal distribution

the statistic has under the null hypothesis. In your simulation, you may assume

the data pairs (x, y) come from independent Normal distributions and that the test statistic

corresponds to a test of zero correlation. [5]

References

[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data

mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553.

ISSN: 0167-9236.