代做CSCI 360 Lab 4帮做Python编程

- 首页 >> Web

Lab 4

CSCI 360

You can use sklearn and Pandas or any other Python package for this lab.

1. Gene expression cancer RNA-Seq dataset

This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of 20531 gene expressions of 801 patients having different types of tumor: BRCA (Breast or Ovarian), KIRC (Kidney renal clear cell carcinoma), COAD (Colon adenocarcinoma), LUAD (Lung adenocarcinoma) and PRAD (Prostate adenocarcinoma). This is a very high dimensional dataset that calls for a method such as Na¨ıve Bayes’, which can handle high-dimensional data very well. Remember that because of the curse of dimensionality, it is impossible to find an accurate joint distribution of 20531 features in five classes with only a few hundred observations, while fining the marginal distributions is straightforward.

The goal of this lab is to classify tumors based on their gene expressions.

(a) Load the files data.csv and labels.csv from the Dropbox folder. They con-tain the Data Set from: https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq. data.csv contains the genetic features for each tumor and labels.csv contains the label of each tumor. (5 pts)

(b) Exploratory data analysis:

i. Select the first 640 instances as the training set and the rest of the data as the test set. (5 pts)

ii. Encode the classes as follows BRCA = 0, KIRC = 1, COAD = 2, LUAD = 3, and PRAD = 4. You can use Ordinal Encoder. (5 pts)

(c) Classification using Gaussian Na¨ıve Bayes

Because all the features are continuous, one can fit normal/Gaussian marginal pdfs to each of them in each class.

i. Use sklearn’s Gaussian Na¨ıve Bayes method to build a classifier based on training data. Report the training misclassification error rate (the percentage of training data that are misclassified). (20 pts)

ii. Use sklearn’s Gaussian Na¨ıve Bayes method to classify test data, using the model you developed in 1(c)i. Report test misclassification error rate (the percentage of training data that are misclassified). (20 pts)

(d) Classification using Bernoulli Na¨ıve Bayes

i. Calculate the median of each of the gene features in the training set. Binarize the features in the training set: any feature greater than or equal to the median of that feature must be converted to 1 and any feature less than the median must be converted to zero. Binarize the features in the test set using the medians you found for features in the training set. Any feature value in the test set that is greater than or equal to the median of the corresponding feature in the training set must be converted to 1 and any feature value less than the median of the corresponding feature in the test set must be converted to 0. (5 pts)

ii. Use sklearn’s Bernoulli Na¨ıve Bayes method with Laplace Smoothing to build a classifier based on binarized training data. Report the training misclassifi-cation error rate (the percentage of training data that are misclassified). (20 pts)

iii. Use sklearn’s Bernoulli Na¨ıve Bayes method to classify test data, using the model you developed in 1(d)ii. Report test misclassification error rate (the percentage of training data that are misclassified). (20 pts)

2. Extra Credit (Categorical Na¨ıve Bayes) (20 pts)

(a) Create 10 equally spaced bins between the maximum and minimum of each feature in the training set and convert the features to categorical training data using those bins. Convert the test data based on the bins you calculated for training daya into categorical features.(4 pts)

(b) Use sklearn’s Categorical Na¨ıve Bayes method with Laplace Smoothing to build a classifier based on categorized training data. Report the training misclas-sification error rate (the percentage of training data that are misclassified). (8 pts)

(c) Use sklearn’s Categorical Na¨ıve Bayes method to classify test data, using the model you developed in 2b. Report test misclassification error rate (the per-centage of training data that are misclassified). (8 pts)




站长地图