代做DATA3888 (2024): Assignment 1帮做Python语言程序
- 首页 >> CSDATA3888 (2024): Assignment 1
Instructions
1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown
or Quarto. Name your file as SIDXXX_Assignment.html” where XXX is your Student ID. 2. Under author, put your Student ID at the top of the Rmd file (NOT your name).
3. For your assignment, please use set.seed(3888) at the start of each chunk (where required). 4. Do not upload the code file (i.e. the Rmd or qmd file).
5. You must use code folding so that the marker can inspect your code where required.
6. Your assignment should make sense and provide all the relevant information in the text when the code is hidden. Don’t rely on the marker to understand your code.
7. Any output that you include needs to be explained in the text of the document. If your code chunk generates unnecessary output, please suppress it by specifying chunk options like message = FALSE.
8. Start each of the 3 questions in a separate section. The parts of each question should be in the same section.
9. You may be penalised for excessive or poorly formatted output.
Question 1: Reef
Between 2014-2017, marine scientists recorded an unprecedented global coral bleaching event. Your friend Farhan is a marine science expert who wants to study the environmental variables that may have triggered this event. To do this, we will use a public dataset, curated by Sally and colleagues. This dataset records coral bleaching events at 3351 locations in 81 countries from 1998 to 2017 with a suite of environmental and temperature metrics. The data is in the file Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv and the full descrip- tion of the variables can be found in the supplementary table of the study.
Part (a)
Farhan has noticed on average the North of Australia experienced higher levels of coral bleaching compared to the South, during the global bleaching event from 2014-2017. In the paper, the authors find that the following variables are associated with the probability of coral bleaching.
• TSA_Frequency_Standard_Deviation
• Temperature_Mean
• TSA_Frequency
• Temperature_Kelvin_Standard_Deviation
• TSA_DHW_Standard_Deviation
• SSTA_Frequency_Standard_Deviation
Create one informative graphic to visualise how these six variables are different between the North and South of Australia during the 2014-2017 global coral bleaching event. Explain any data filtering or transformation that you perform. Comment on the visualisation and suggest at least one variable that appears to be different between the North and the South and thus may be associated with the higher levels of bleaching observed in the North.
Note: the midpoint of Australia is located at -23 degrees Latitude. Observations higher than -23 degrees latitude is considered North Australia. Your graphic can have multiple panels.
Part (b)
Farhan is interested in exploring which reefs were the most affected by the 2014-2017 global bleaching event, across the globe. Create an interactive map visualisation to show the average proportion of coral bleaching between 2014-2017, that allows a marine scientist to identify the names of the most affected coral reefs, the region (recorded as State.Province.Island) and the values of the measurements of the associated environmental variables identified in part (a). Justify your choice of visualisation, and comment on the result. List 4 regions that were severely bleached in this time period.
Part (c)
Farhan wants to explore the impact of environmental variables on coral bleaching in the most affected regions. For the regions identified in part (b), create one informative visualisation to show how the average bleaching has changed over time (not restricted to 2014-2017), and its relationship with one of the associated environmental variables identified in part (a). Comment on the visualisation.
Note: your graphic can have multiple panels.
Question 2: Kidney
Your friend Harry is a nephrologist (kidney specialist) who is interested in building an accurate classifier to detect graft rejection in his kidney transplant patients. He is also interested in knowing which genes may be affecting graft rejection. In this problem, we will build a classification model using the public data set GSE138043. We will perform feature selection and build a classifier, estimating its accuracy on unseen data.
Part (a)
Harry wants to know the most differentially expressed genes between patients that experience graft rejection and stable patients. Use the topTable function in the limma package to output the gene symbols of the 10 most differentially expressed genes.
Hint: in the GSE138043 dataset, the outcome is found in the characteristics_ch1 column of the phenoData and the gene symbols are found the in gene_assignment column of the featureData, between the first and second // symbols.
Part (b)
Harry wants to build a random forest classifier to predict whether a patient is stable or experiencing graft rejection and estimate its accuracy on unseen data. To do this, Harry tries to perform repeated cross-validation on the entire data set, but it takes too long to run. To speed up the model training, Harry knows he can implement feature selection in one of 3 parts of the framework on the next page (OPTION A, OPTION B, or OPTION C), however he is not sure which one.
Explain the difference (if any) between the 3 options and which option(s) would be the most appropriate for Harry’s task.
Part (c)
Harry wants to implement feature selection in the most appropriate option of Part (b), but he’s not sure how many features he should select. Use the framework from part (b) to evaluate the performance of a random forest classifier on unseen data with feature selection taking the top 10, 50 or 100 genes. Visualise your results and comment on them. How many features would you recommend Harry to use?
Hint: if implemented correctly, this code should take no more than a few minutes to run.
Part (d)
Using the optimal number of features found in part (c), build a random forest classifier on the entire training data set, that Harry could implement on future data. Harry wants to know which genes are the most important in making the classification. Output the gene symbols of the top 10 genes in terms of importance in the random forest classifier. Comment on the overlap between the top 10 important genes in the classifier and the top 10 differentially expressed genes (if your final model only uses 10 genes, comment on the concordance in ranking of the 10 genes).
Hint: in a random forest model fit, the feature importance can be obtained by fit$importance, where a higher value indicates higher importance in the classifier.
Question 2 Part (b) appendix
set.seed(3888)
X = t(exprs(gse))
y = ifelse(grepl("non-AR", pData(gse)$characteristics_ch1), "Stable", "Rejection")
cvK = 5
n_sim = 50
cv_accuracy_gse1b = numeric(n_sim)
### OPTION A ###
for (i in 1:n_sim) {
cvSets = cvFolds(nrow(X), cvK)
cv_accuracy_folds = numeric(cvK)
### OPTION B ###
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_train = X[-test_id,]
X_test = X[test_id,]
y_train = y[-test_id]
y_test = y[test_id]
### OPTION C ###
rf_fit = randomForest(x = X_train, y = as.factor(y_train))
predictions = predict(rf_fit, X_test)
cv_accuracy_folds[j] = mean(y_test == predictions)
}
cv_accuracy_gse1b[i] = mean(cv_accuracy_folds)
}
Question 3: Brain
Your friend Shila is a physicist who needs your help in building a classifier to detect left and right eye movements from brain EEG signals in real time. She has a data set stored under zoe_spiker.zip that contains brain signal series (each series is a file) which corresponds to sequences of eye movements of varying lengths.
The file name corresponds to the true eye movement. For example the file LRL_z.wav corresponds to left-right-left eye movements; the file LLRLRLRL_z.wav corresponds to left-left-right-left-right-left-right-left eye movements. There are a total of 31 files.
The folder also contains two RDS files which may be used to train an event detection classifier (training_data.rds, training_labels.rds)
Part (a)
The first stage of our classifier is to identify events (eye movement). Shila has provided some training data (training_data.rds) which corresponds to waves, and labels (training_labels.rds) where TRUE represents the presence of an event and FALSE represents no event. Use the tsfeatures package to calculate some autocorrelation features and build a random forest classifier to detect events.
Report and comment on the accuracy of this model.
Hint: use tsfeatures(training_data, c("acf_features")) to compute the autocorrelation features from training_data. In a random forest model fit, the confusion matrix of out-of-bag predictions can be obtained by fit$confusion. In a random forest classifier, the out-of-bag predictions can be treated as the predictions on a independent data set.
Part (b)
Build a classification rule for detecting {L,R} under a streaming condition, using the trained Random Forest model from part (a) in a window to identify events, and using the min-max rule to classify events into “Left” or “Right” (Lab 3 Exercise 2.3). Demonstrate your classifier on a length 3, 8 and long wave file (note that the result should be reasonable, but doesn’t have to be good). You may use the code template on the following page.
Part (c)
Shila thinks multiple window sizes must be evaluated to find the best Random Forest streaming classifier.
Compare the performance of the Random Forest streaming classifier for detecting {L,R} under a streaming condition, using multiple window sizes. Use the short wave files to evaluate performance. Which window size gives the best performance? Justify your answer with appropriate visualisations.
Hint: you may use the Levenshtein similarity metric to evaluate the accuracy of your predictions. This can be
computed via stringdist::stringsim, with method set to "lv".
The increment of your window should always be 1/3 of the window size.
increment = window_size/3
Part (d)
Shila’s friend Jean thinks a zero-crossing classification rule will perform. just as well to the Random Forest classifier.
Build a classification rule for detecting {L,R} under a streaming condition, using the number of zero- crossings in a window to identify events (from Lab 3 Exercise 1.3), and using the min-max rule to classify events into “Left” or “Right” (Lab 3 Exercise 2.3). You may use any window size that gives reasonable performance.
Jean also thinks multiple thresholds must be evaluated to find the best zero-crossings classification rule.
Compare the performance of the zero-crossings classification rule using multiple thresholds on the short wave files. Which threshold gives the best performance? Justify your answer with appropriate visualisations.
Part (e)
For both the best models that you found in part (c) and part (d), evaluate its performance on sequences of varying lengths. Does the length of the sequence have an impact on the classification accuracy? Which classifier performs the best on this data set, and why might you choose one over the other? Justify your answer with appropriate visualisations.
Question 3 Part (b) appendix
ts_features_classifier = function(wave_file,
window_multiplier = 1) {
window_size = [email protected]*window_multiplier
increment = window_size/3
Y = wave_file@left
xtime = seq_len(length(Y))/[email protected]
predicted_labels = c()
window_lb = 1
max_time = length(Y)
while(max_time > window_lb + window_size) {
window_ub = window_lb + window_size
window = Y[window_lb:window_ub]
event = <event detection>
if (event) {
predicted = <LR prediction>
predicted_labels = c(predicted_labels, predicted)
window_lb = window_lb + window_size
} else {
window_lb = window_lb + increment
}
}
return(paste(predicted_labels, collapse = ""))
}