讲解MATH2319-Assignment1辅导Python
- 首页 >> C/C++编程3/4/20, 1:03 pmMATH2319_2020_Assignment1
Page 1 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html
MATH2319 Machine
Learning
Semester 1, 2020
Assignment 1
3/4/20, 1:03 pmMATH2319_2020_Assignment1
Page 2 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html
Assignment Rules: Please read
carefully!
1. Assignments are to be treated as "limited open-computer" take-home exams. That is,
you must work on the assignments on your own. You must not discuss your assignment
solutions with anyone else (including your classmates, paid/unpaid tutors, friends, parents,
relatives, etc.) and the submission you make must be your own work. In addition, no
member of the teaching team will assist you with any issues that are directly related to your
assignment solutions.2 All solutions must be provided in Python 3.6+ with results
documented in Jupyter Notebook.
2. You must clearly show all your work for full credit. In particular, you need to clearly label your
solutions with appropriate headings subheading, lists, etc. Also keep in mind that just
providing Python code will not get you full credit even if it's correct. You need to explain all
your reasoning and document all your steps in plain English. That is, you must submit a
professional piece of work as your assignment solutions.
3. For solutions that are ambiguous, or solutions that are all over the place, you may receive
zero points (even if it's correct!) as we have no obligation to spend hours and hours of our
time to decipher your notebook.
4. Once you are done, it is your responsibility to run your notebook and then save it as an
HTML file before submission. Your solutions shall be marked exactly as they appear in your
HTML file.
5. You must submit a single file (in HTML format) that contains all your solutions to all the
questions.
6. For other assignment rules, please refer to this web page:
https://rmit.instructure.com/courses/67061/assignments/424265
(https://rmit.instructure.com/courses/67061/assignments/424265)
7. It is your responsibility to follow any and all assignment rules stated in the above web
page.
8. Do not forget to include the Honour Code or your assignment shall not be marked.
9. If you need to make any assumptions at any point so that you can continue for any question,
please state these assumptions and clearly explain your reasoning.
10. Suspected cheating incidents shall be reported to RMIT Student Conduct Office for possible
disciplinary action.
Question 1
(65 points)
Data preprocessing is a critical component in machine learning and its importance cannot be
3/4/20, 1:03 pmMATH2319_2020_Assignment1
Page 3 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html
Data preprocessing is a critical component in machine learning and its importance cannot be
overstated. If you do not prepare your data correctly, you can use the fanciest machine learning
algorithm in the world and your results will still be incorrect.
For this question, you will perform any and all data preprocessing steps on a dataset on the UCI
ML Datasets Repository so that the clean dataset you end up with can be directly fed into any
classification algorithm within the Scikit-Learn Python module without any further changes.
This dataset is the Credit Approval data at the following address:
https://archive.ics.uci.edu/ml/datasets/Credit+Approval
(https://archive.ics.uci.edu/ml/datasets/Credit+Approval)
The UCI Repository provides four datasets, but only two of them will be relevant:
crx.names : Some basic info on the dataset together with the feature names values
crx.data : The actual data in comma-separated format
Instructions:
1. If you are having issues with reading in the dataset directly (which is most likely due to UCI's
or your web browser's SSL settings), you can download the file on your computer manually
and then upload it to your Azure project, which you can then read in as a local file.
2. This is a very small dataset. So please do not perform any sampling.
3. Make sure you follow the best practices outlined in the Data Prep lecture presentation (on
Chapters 2 and 3) on Canvas and the Data Prep tutorial
(https://www.featureranking.com/tutorials/machine-learning-tutorials/data-preparation-for-
machine-learning/) on our website.
4. As a general rule, all categorical features need to be assumed to be nominal unless you have
evidence to the contrary.
5. As for potential outliers in numerical descriptive features, this is an anonymised dataset, so
please do not flag any numerical values as outliers regardless of their value for this question.
6. For this question, you are to set all unusual values (and all outliers, if there are any) to missing
values. Also, you are to impute any missing values with the mode for categorical features
and with the median for numerical features. If there are multiple modes for a categorical
feature, use the mode that comes first alphabetically.
7. For the A2 numerical descriptive feature, you are to discretize it via equal-frequency binning
with 3 bins named "low", "medium", and "high", and then use integer encoding for it.
8. For normalization, you are to use standard scaling. You are allowed to use Scikit-Learn's
preprocessing submodule for this purpose.
9. The target feature needs be the last column in the clean data and its name needs to be
target .
10. You must perform all your preprocessing steps using Python. For any cleaning steps that you
perform via Excel or simple find-and-replace in a text editor or any other language or in any
other way, you will receive zero points.
11. It's critical that the final clean data does not need any further processing so that it will work
without any issues with any classifier within Scikit-Learn.
3/4/20, 1:03 pmMATH2319_2020_Assignment1
Page 4 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html
12. Once you are done, name your final clean dataset as df_clean (if it's not already named as
such).
13. At the end, run each one of the following three lines in three separate code cells for a
summary:
df_clean.shape
df_clean.describe(include='all').round(3)
df_clean.head(5)
14. Save your final clean dataset exactly as "df_clean.csv". Make sure your file has the correct
column names (including the target column). Next, you will upload this CSV file on to Canvas
as part of your assignment solutions. That is, in addition to an HTML file (that contains your
solutions), you also need to upload your clean data in CSV format on Canvas with this
name.
Please do not ask teaching staff any questions about this Credit Approval dataset as we do
not know anything more than what UCI already provides on their website.
If you still need any help, please remember that you are allowed to search the Internet for generic
questions, such as "how to change column order in Pandas" etc. Keep in mind that 99% of the
time, a Google search will provide you a much faster response for your questions when compared
to posting it on a discussion forum.
If you run into any errors, the best course of action would be just to Google your error message.
Good luck!
For Question 2, please follow the instructions below:
1. Textbook info can be found on Canvas at this link:
https://rmit.instructure.com/courses/67061/pages/course-resources
(https://rmit.instructure.com/courses/67061/pages/course-resources)
2. You must show all your calculations and you must perform all your calculations using Python.
You must also document all your work in Jupyter notebook format.
3. You may not use any one of the classifiers in the Scikit-Learn module. Likewise, you may not
use any one of the preprocessing methods in the Scikit-Learn module. You will need to show
and explain all your solution steps without using the Scikit-Learn module. You will not
receive any points for any work that uses Scikit-Learn for Question 2. The reason for this
restriction is so that you get to learn how some things work behind the scenes. But don't
worry, you will be using Scikit-Learn quite a bit in subsequent assessments.
3/4/20, 1:03 pmMATH2319_2020_Assignment1
Page 5 of 5file:///Users/jessica/Downloads/MATH2319_2020_Assignment1.html
Question 2
(35 points, 7 points for each part)
Solve Chapter 5, Exercise 3 (all five parts) in the textbook, but instead of the Euclidean distance,
use the Manhattan distance. All exercise parts must be solved with the Manhattan distance
metric.
www.featureranking.com