# MATH42715辅导、R编程设计辅导

- 首页 >> CS MATH42715: Introduction to Statistics for Data Science

Assignment 2

1 Submission Information

Key information:

Submission deadline: Midday on Friday 9th December 2022.

Submission format: via Gradescope as a single electronic file in PDF format.

Submission link: https://www.gradescope.com/courses/453240/assignments/2364944

Please note that:

Help on the statistical content will only be given until 6pm on Wednesday 7th December 2022. After

this time, only help with online submission will be available. For queries about online submission after

6pm on Wednesday 7th December 2022, please email Dr Tahani Coolen-Maturi .

The report should not exceed 8 pages. You are advised to include an Appendix, which does not count

towards the page limit, detailing enough R code to allow the reader to reproduce your analysis. You

may also like to use the Appendix to include supplementary tabular and graphical output.

2 Assignment Brief

The assignment is worth 60% of the overall mark for the module. Your work should be presented as a coherent

report, giving consideration to the tasks and marking scheme detailed in Sections 2.2.1 to 2.2.4 below. You

do not need to comprehensively describe everything you have done to explore and model the data. However,

you should provide a narrative which details and justifies the salient features of your approach, in addition to

reporting and interpreting your results in the context of the scientific problem presented in Section 2.1. There

will also be marks for the academic writing, structuring and presentation of this report; see See Section 2.2.5

below.

2.1 Data

In this assignment, you will analyse the BreastCancer data set which concerns characteristics of breast tissue

samples collected from 699 women in Wisconsin using fine needle aspiration cytology (FNAC). This is a

type of biopsy procedure in which a thin needle is inserted into an area of abnormal-appearing breast tissue.

Nine easily assessed cytological characteristics, such as uniformity of cell size and shape, were measured for

each tissue sample on a one to ten scale. Smaller numbers indicate cells that looked healthier in terms of

that characteristic. Further histological examination established whether each of the samples was benign or

malignant. The objective of the clinical experiment was to determine the extent to which a tissue sample

could be classified as benign or malignant using only the nine cytological characteristics.

For the purposes of this assignment, you may assume that the patients can be regarded as a random sample

from the population of women experiencing symptoms of breast cancer.

The data set is part of the mlbench package. The package can be installed by typing into the console

1

install.packages("mlbench")

It can then be loaded into R and inspected as follows:

## Load mlbench package

library(mlbench)

## Load the data

data(BreastCancer)

## Check size

dim(BreastCancer)

## [1] 699 11

## Print first few rows

head(BreastCancer)

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size

## 1 1000025 5 1 1 1 2

## 2 1002945 5 4 4 5 7

## 3 1015425 3 1 1 1 2

## 4 1016277 6 8 8 1 3

## 5 1017023 4 1 1 3 2

## 6 1017122 8 10 10 8 7

## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class

## 1 1 3 1 1 benign

## 2 10 3 2 1 benign

## 3 2 3 1 1 benign

## 4 4 3 7 1 benign

## 5 1 3 1 1 benign

## 6 10 9 7 1 malignant

More information on the variables can be found by typing ?BreastCancer in the console.

2.2 Tasks and Making Scheme

Your ultimate goal is to build a classifier for the Class – benign or malignant – of a tissue sample based on

(at least some of) the nine cytological characteristics. It should be stressed that this is a real data set and

there is no “correct” answer. The sections below indicate the components your report should include and the

number of marks attributed to each.

2.2.1 Cleaning the Data (10 marks)

Before starting any analysis, you should clean the data:

Technically, the nine cytological characteristics are ordinal variables on a 1 – 10 scale. In the

BreastCancer data, they are encoded as factors. For the purposes of this assignment, we will treat

them as quantitative variables. You should carefully convert the factors to quantitative variables.

This data set contains some missing observations on predictors, encoded as NA. For the purposes of this

assignment, you should remove all of the rows where there are missing values before carrying out any

further analysis. To do this, you may find the is.na function helpful. For instance

## Print 24th row of Breast Cancer data and note there is a NA in the

## Bare.nuclei column:

BreastCancer[24,]

2

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size

## 24 1057013 8 4 5 1 2

## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class

## 24 7 3 1 malignant

## Test whether each element on the 24th row is a NA:

is.na(BreastCancer[24,])

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size

## 24 FALSE FALSE FALSE FALSE FALSE FALSE

## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class

## 24 TRUE FALSE FALSE FALSE FALSE

Remember to provide a concise summary of how you cleaned the data.

2.2.2 Exploratory Data Analysis (20 marks)

Consider some exploratory data analysis. For example, how might you summarise the data graphically and

numerically? What does this tell you about the relationships between the response variable and predictor

variables and about the relationships between predictor variables? Remember to set your discussion in the

context of the scientific problem presented in Section 2.1.

2.2.3 Modelling (35 marks)

You should build classifiers using:

Logistic regression, with best subset selection;

The Bayes classifier for linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA) or

both.

Remember to provide an overview of each modelling technique and, where appropriate, a consideration of

any modelling assumptions you are making. For your selected logistic regression model, you should present

the coefficients of the fitted model, and any other useful summaries. For LDA or QDA present estimates of

the group means. In each case, discuss what your results show. For example, what do the parameters tell

you about the relationships between the response and predictor variables?

2.2.4 Model Comparison (15 marks)

Compare the performance of your models using cross-validation based on the test error. Think about how

you might do this in a way that makes the comparison fair. Remember to provide an overview of the main

statistical ideas underpinning your model comparison.

2.2.5 Report Writing and Presentation (20 marks)

You should present your work in the form of a report. This should be well structured, written and presented.

It should have sections, beginning with an introduction and ending with conclusions. Any graphs and tables

should be appropriately displayed and captioned.

As remarked in Section 1, the report should not exceed eight pages and you are advised to include an

Appendix, which does not count towards the page limit, detailing enough R code to allow the reader to

reproduce your analysis. You may also like to use the Appendix to include supplementary tabular and

graphical output which should be appropriately displayed and captioned.

3

Assignment 2

1 Submission Information

Key information:

Submission deadline: Midday on Friday 9th December 2022.

Submission format: via Gradescope as a single electronic file in PDF format.

Submission link: https://www.gradescope.com/courses/453240/assignments/2364944

Please note that:

Help on the statistical content will only be given until 6pm on Wednesday 7th December 2022. After

this time, only help with online submission will be available. For queries about online submission after

6pm on Wednesday 7th December 2022, please email Dr Tahani Coolen-Maturi .

The report should not exceed 8 pages. You are advised to include an Appendix, which does not count

towards the page limit, detailing enough R code to allow the reader to reproduce your analysis. You

may also like to use the Appendix to include supplementary tabular and graphical output.

2 Assignment Brief

The assignment is worth 60% of the overall mark for the module. Your work should be presented as a coherent

report, giving consideration to the tasks and marking scheme detailed in Sections 2.2.1 to 2.2.4 below. You

do not need to comprehensively describe everything you have done to explore and model the data. However,

you should provide a narrative which details and justifies the salient features of your approach, in addition to

reporting and interpreting your results in the context of the scientific problem presented in Section 2.1. There

will also be marks for the academic writing, structuring and presentation of this report; see See Section 2.2.5

below.

2.1 Data

In this assignment, you will analyse the BreastCancer data set which concerns characteristics of breast tissue

samples collected from 699 women in Wisconsin using fine needle aspiration cytology (FNAC). This is a

type of biopsy procedure in which a thin needle is inserted into an area of abnormal-appearing breast tissue.

Nine easily assessed cytological characteristics, such as uniformity of cell size and shape, were measured for

each tissue sample on a one to ten scale. Smaller numbers indicate cells that looked healthier in terms of

that characteristic. Further histological examination established whether each of the samples was benign or

malignant. The objective of the clinical experiment was to determine the extent to which a tissue sample

could be classified as benign or malignant using only the nine cytological characteristics.

For the purposes of this assignment, you may assume that the patients can be regarded as a random sample

from the population of women experiencing symptoms of breast cancer.

The data set is part of the mlbench package. The package can be installed by typing into the console

1

install.packages("mlbench")

It can then be loaded into R and inspected as follows:

## Load mlbench package

library(mlbench)

## Load the data

data(BreastCancer)

## Check size

dim(BreastCancer)

## [1] 699 11

## Print first few rows

head(BreastCancer)

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size

## 1 1000025 5 1 1 1 2

## 2 1002945 5 4 4 5 7

## 3 1015425 3 1 1 1 2

## 4 1016277 6 8 8 1 3

## 5 1017023 4 1 1 3 2

## 6 1017122 8 10 10 8 7

## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class

## 1 1 3 1 1 benign

## 2 10 3 2 1 benign

## 3 2 3 1 1 benign

## 4 4 3 7 1 benign

## 5 1 3 1 1 benign

## 6 10 9 7 1 malignant

More information on the variables can be found by typing ?BreastCancer in the console.

2.2 Tasks and Making Scheme

Your ultimate goal is to build a classifier for the Class – benign or malignant – of a tissue sample based on

(at least some of) the nine cytological characteristics. It should be stressed that this is a real data set and

there is no “correct” answer. The sections below indicate the components your report should include and the

number of marks attributed to each.

2.2.1 Cleaning the Data (10 marks)

Before starting any analysis, you should clean the data:

Technically, the nine cytological characteristics are ordinal variables on a 1 – 10 scale. In the

BreastCancer data, they are encoded as factors. For the purposes of this assignment, we will treat

them as quantitative variables. You should carefully convert the factors to quantitative variables.

This data set contains some missing observations on predictors, encoded as NA. For the purposes of this

assignment, you should remove all of the rows where there are missing values before carrying out any

further analysis. To do this, you may find the is.na function helpful. For instance

## Print 24th row of Breast Cancer data and note there is a NA in the

## Bare.nuclei column:

BreastCancer[24,]

2

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size

## 24 1057013 8 4 5 1 2

## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class

## 24 7 3 1 malignant

## Test whether each element on the 24th row is a NA:

is.na(BreastCancer[24,])

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size

## 24 FALSE FALSE FALSE FALSE FALSE FALSE

## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class

## 24 TRUE FALSE FALSE FALSE FALSE

Remember to provide a concise summary of how you cleaned the data.

2.2.2 Exploratory Data Analysis (20 marks)

Consider some exploratory data analysis. For example, how might you summarise the data graphically and

numerically? What does this tell you about the relationships between the response variable and predictor

variables and about the relationships between predictor variables? Remember to set your discussion in the

context of the scientific problem presented in Section 2.1.

2.2.3 Modelling (35 marks)

You should build classifiers using:

Logistic regression, with best subset selection;

The Bayes classifier for linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA) or

both.

Remember to provide an overview of each modelling technique and, where appropriate, a consideration of

any modelling assumptions you are making. For your selected logistic regression model, you should present

the coefficients of the fitted model, and any other useful summaries. For LDA or QDA present estimates of

the group means. In each case, discuss what your results show. For example, what do the parameters tell

you about the relationships between the response and predictor variables?

2.2.4 Model Comparison (15 marks)

Compare the performance of your models using cross-validation based on the test error. Think about how

you might do this in a way that makes the comparison fair. Remember to provide an overview of the main

statistical ideas underpinning your model comparison.

2.2.5 Report Writing and Presentation (20 marks)

You should present your work in the form of a report. This should be well structured, written and presented.

It should have sections, beginning with an introduction and ending with conclusions. Any graphs and tables

should be appropriately displayed and captioned.

As remarked in Section 1, the report should not exceed eight pages and you are advised to include an

Appendix, which does not count towards the page limit, detailing enough R code to allow the reader to

reproduce your analysis. You may also like to use the Appendix to include supplementary tabular and

graphical output which should be appropriately displayed and captioned.

3