辅导ADM3308、讲解Data Mining、Python语言辅导、讲解C++、Java编程设计 讲解Java程序|解析C/C++编程
- 首页 >> Algorithm 算法 ADM3308: Business Data Mining
Data Mining Project Using IBM SPSS Modeler
(Team work)
_____________________________________________________________________________________
_____________________________________________________________________________________
Weight: 25% of the final mark. This is a team work project (only one submission per team).
_____________________________________________________________________________________
Important Note: Read the following academic integrity statement, type in your full name and student ID, and include a copy in your submission. Submitting this form electronically by the team representative is considered as signing the document by BOTH members of the team.
Personal Ethics & Academic Integrity Statement
By typing in my name and student ID on this form and submitting it electronically, I am attesting to the fact that I have reviewed not only my own work, but the work of my team member, in its entirety.
I attest to the fact that my own work in this project adheres to the fraud policies as outlined in the Academic Regulations in the University’s Undergraduate Studies Calendar. I further attest that I have knowledge of and have respected the “Beware of Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the best of my knowledge, I also believe that each of my group colleagues has also met the aforementioned requirements and regulations. I understand that if my group assignment is submitted without a completed copy of this Personal Work Statement from each group member, it will be interpreted by the school that the missing student(s) name is confirmation of non-participation of the aforementioned student(s) in the required work.
We, by typing in our names and student IDs on this form and submitting it electronically,
warrant that the work submitted herein is our own group members’ work and not the work of others
acknowledge that we have read and understood the University Regulations on Academic Misconduct
acknowledge that it is a breach of University Regulations to give or receive unauthorized and/or unacknowledged assistance on a graded piece of work
The IBM SPSS Modeler is a commercial data mining package offered by the IBM capable of performing data mining tasks including predictive and descriptive models with user-friendly interfaces. The IBM Modeler is available on the computers in the lab. There will be tutorials presented to class on using the IBM Modeler for data mining. Students are also required to consult on-line resources to learn more about IBM Modeler.
For this project, you are required to complete two parts:
Part-1 (100 points): Data mining modelling project using a selected datasets from Table-1.
Part-2 (30 points): Perform data pre-processing and data cleaning on the raw dataset provided to you (Unclean-Bank-Data.Xlsx) using IBM SPSS Modeller nodes to clean and pre-process the data.
PART-1
(A) Dataset Selection:
Each team must select one of the datasets listed in Table-1 (or from other recommended repositories with the pre-approval of the professor), and announce it on the “Discussion Board” on the Forum named “Announcing Dataset Selection”. Post your name, your tem-member’s name, and the dataset selected. If a dataset is already taken by one of the teams, as posted on the Forum, that dataset cannot be selected by other teams. Therefore, I recommended that you select your dataset and announce it on the Discussion Board as early as possible.
NOTE: You may choose a dataset other than what listed in Table-1 with the professor prior approval. If you would like to analyze a dataset not listed in Table-1, please email me the details of the dataset for my review (e.g. the source of the data, how many records, how many attributes).
(B) Data Analysis and Model Building:
You are required to import the data, perform pre-processing tasks if needed (such as reformatting the data, normalizing it, dealing with missing values, dealing with outlier), followed by two or more modeling tasks such as classification (Decision tree, Bayesian, KNN, neural networks, etc.), clustering (K-means, agglomerative), and association rules mining.
(C) Project Report for Part-1:
Your report for this part of the project should include:
Explaining the data you selected for your project (attributes, instances, etc.)
Explaining your pre-processing tasks if any (cleaning, transforming, normalizing, etc.)
Explaining the data mining modeling techniques you performed on the data (at least two techniques)
Demonstrating the graphs/tables of the results produced by the techniques
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings
Concluding remarks, your recommendations, actionable discoveries, and future trends/studies you would recommend
Overall, your report for this part of the project should be 15 to 25 pages long (including graphs). Use 12pt Times New Roman font, with 1.5 line space. Keep a margin of 1” on all sides of the page.
Rubrics for Part-1
Your report for Part-1 of the project will be evaluated as follows:
Components of the Report (Part-1) Points
Abstract OR Executive summary (or abstract) 10
Explanation of the data set, and the pre-processing tasks (if any) to prepare the data 10
Explanation of at least two data mining tasks you performed on the data. Also, explain why you considered the specific data mining tasks for your dataset 20
Relevant graphs showing the output results of the techniques you applied 20
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings 10
A conclusion section summarizing your findings, discussing the results, your understanding of the results, your recommendation, and any useful patterns, rules, prediction or future trend you infer from the data 10
Overall organization of the paper, its soundness and readability, and quality of the presentation 20
Total (Part-1) 100
(D) List of Datasets:
Select one of the following datasets, then post a message on the Discussion Board on Brightspace to claim your dataset.
Table-1: List of datasets for Part-1 of the project
Note: These datasets are available at the UCI Machine Learning Repository. For more information, visit http://archive.ics.uci.edu/ml/datasets.html
# Name Number of features Number of Samples Comments
1 Waveform Database Generator (version 2) 40 5000 Use the dataset without Noise
2 Statlog (Landsat Satellite) 36 6435 Training and Testing datasets are different
3 seismic-bumps 22 8124
4 Image Segmentation 19 2310 Use only the testing dataset
5 Bank Marketing 17 45211
6 Pen-Based Recognition of Handwriting Digits 16 10992 Training and Testing datasets are different
7 Student Performance 33 649
8 Adult 14 48842 Training and Testing datasets are different
9 Statlog (Shuttle) 9 58000
10 Abalone 8 4177
11 Nursery 8 12960
12 Yeast 8 1484
13 One-hundred plant species leaves data set 64 1600 Use just-data_Mar_64.txt
14 Spambase 57 4601
15 Cardiotocography 23 2126
16 Statlog (German Credit Card) 20 1000
17 Letter Recognition 16 20000
18 EEG Eye State 15 14980
19 Page Blocks Classification 10 5473
20 Contraceptive Method Choice 9 1473
21 Weight lifting exercises monitored 10 39242 Use the following features: roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, (class as output)
22 Connect-4 42 67557
23 Mushroom 22 8124
24 Default of credit card clients 24 30000
25 Autism Screening Adult Data Set 21 704
26 Drug consumption (quantified) Data Set 32 1885
27 Polish companies bankruptcy data, Data Set 64 10503
PART-2
In this part of the project, all teams will use the dataset Unclean-Bank-Data.Xlsx posted on the “Project Description” page of the course website.
This dataset includes missing values, invalid values, and outliers. You should use the IBM SPSS Modeler nodes to pre-process and clean the data.
Do not remove a record if there is only one missing value in that record. Instead, use the IBM Modeler to fill in the missing value with an algorithm of your choice.
Similarly, do not remove a record if it has only one invalid value. Instead, use the IBM Modeler to fill in the invalid value with an algorithm of your choice.
If you find a record with more than one missing value, or more than one invalid value, then you may either remove the record, or use the IBM Modeler to fill in for the missing or invalid values.
If you detect outliers, you may then delete the entire record.
You may also want to do other pre-processing tasks such as data normalization, binning data, etc.
Deliverables for Part-2:
1- Include in your project report a short explanation of three different cleaning and pre-processing tasks you applied on the data using the IBM SPSS Modeller.
2- Also, include the clean dataset (name it “Clean-Bank-Data.xlsx”) in your submission together with your project report (you may submit everything in one zip file).
Rubrics for Part-2
Your report for Part-2 of the project will be evaluated as follows:
Components of the Report (Part-2) Points
Explanation of three cleaning and pre-processing tasks applied on the data; explaining the results after you pre-processed the data; including the Clean-Bank-Data.xlsx with your submission.
3 X 10
Total (Part-2) 30
Data Mining Project Using IBM SPSS Modeler
(Team work)
_____________________________________________________________________________________
_____________________________________________________________________________________
Weight: 25% of the final mark. This is a team work project (only one submission per team).
_____________________________________________________________________________________
Important Note: Read the following academic integrity statement, type in your full name and student ID, and include a copy in your submission. Submitting this form electronically by the team representative is considered as signing the document by BOTH members of the team.
Personal Ethics & Academic Integrity Statement
By typing in my name and student ID on this form and submitting it electronically, I am attesting to the fact that I have reviewed not only my own work, but the work of my team member, in its entirety.
I attest to the fact that my own work in this project adheres to the fraud policies as outlined in the Academic Regulations in the University’s Undergraduate Studies Calendar. I further attest that I have knowledge of and have respected the “Beware of Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the best of my knowledge, I also believe that each of my group colleagues has also met the aforementioned requirements and regulations. I understand that if my group assignment is submitted without a completed copy of this Personal Work Statement from each group member, it will be interpreted by the school that the missing student(s) name is confirmation of non-participation of the aforementioned student(s) in the required work.
We, by typing in our names and student IDs on this form and submitting it electronically,
warrant that the work submitted herein is our own group members’ work and not the work of others
acknowledge that we have read and understood the University Regulations on Academic Misconduct
acknowledge that it is a breach of University Regulations to give or receive unauthorized and/or unacknowledged assistance on a graded piece of work
The IBM SPSS Modeler is a commercial data mining package offered by the IBM capable of performing data mining tasks including predictive and descriptive models with user-friendly interfaces. The IBM Modeler is available on the computers in the lab. There will be tutorials presented to class on using the IBM Modeler for data mining. Students are also required to consult on-line resources to learn more about IBM Modeler.
For this project, you are required to complete two parts:
Part-1 (100 points): Data mining modelling project using a selected datasets from Table-1.
Part-2 (30 points): Perform data pre-processing and data cleaning on the raw dataset provided to you (Unclean-Bank-Data.Xlsx) using IBM SPSS Modeller nodes to clean and pre-process the data.
PART-1
(A) Dataset Selection:
Each team must select one of the datasets listed in Table-1 (or from other recommended repositories with the pre-approval of the professor), and announce it on the “Discussion Board” on the Forum named “Announcing Dataset Selection”. Post your name, your tem-member’s name, and the dataset selected. If a dataset is already taken by one of the teams, as posted on the Forum, that dataset cannot be selected by other teams. Therefore, I recommended that you select your dataset and announce it on the Discussion Board as early as possible.
NOTE: You may choose a dataset other than what listed in Table-1 with the professor prior approval. If you would like to analyze a dataset not listed in Table-1, please email me the details of the dataset for my review (e.g. the source of the data, how many records, how many attributes).
(B) Data Analysis and Model Building:
You are required to import the data, perform pre-processing tasks if needed (such as reformatting the data, normalizing it, dealing with missing values, dealing with outlier), followed by two or more modeling tasks such as classification (Decision tree, Bayesian, KNN, neural networks, etc.), clustering (K-means, agglomerative), and association rules mining.
(C) Project Report for Part-1:
Your report for this part of the project should include:
Explaining the data you selected for your project (attributes, instances, etc.)
Explaining your pre-processing tasks if any (cleaning, transforming, normalizing, etc.)
Explaining the data mining modeling techniques you performed on the data (at least two techniques)
Demonstrating the graphs/tables of the results produced by the techniques
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings
Concluding remarks, your recommendations, actionable discoveries, and future trends/studies you would recommend
Overall, your report for this part of the project should be 15 to 25 pages long (including graphs). Use 12pt Times New Roman font, with 1.5 line space. Keep a margin of 1” on all sides of the page.
Rubrics for Part-1
Your report for Part-1 of the project will be evaluated as follows:
Components of the Report (Part-1) Points
Abstract OR Executive summary (or abstract) 10
Explanation of the data set, and the pre-processing tasks (if any) to prepare the data 10
Explanation of at least two data mining tasks you performed on the data. Also, explain why you considered the specific data mining tasks for your dataset 20
Relevant graphs showing the output results of the techniques you applied 20
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings 10
A conclusion section summarizing your findings, discussing the results, your understanding of the results, your recommendation, and any useful patterns, rules, prediction or future trend you infer from the data 10
Overall organization of the paper, its soundness and readability, and quality of the presentation 20
Total (Part-1) 100
(D) List of Datasets:
Select one of the following datasets, then post a message on the Discussion Board on Brightspace to claim your dataset.
Table-1: List of datasets for Part-1 of the project
Note: These datasets are available at the UCI Machine Learning Repository. For more information, visit http://archive.ics.uci.edu/ml/datasets.html
# Name Number of features Number of Samples Comments
1 Waveform Database Generator (version 2) 40 5000 Use the dataset without Noise
2 Statlog (Landsat Satellite) 36 6435 Training and Testing datasets are different
3 seismic-bumps 22 8124
4 Image Segmentation 19 2310 Use only the testing dataset
5 Bank Marketing 17 45211
6 Pen-Based Recognition of Handwriting Digits 16 10992 Training and Testing datasets are different
7 Student Performance 33 649
8 Adult 14 48842 Training and Testing datasets are different
9 Statlog (Shuttle) 9 58000
10 Abalone 8 4177
11 Nursery 8 12960
12 Yeast 8 1484
13 One-hundred plant species leaves data set 64 1600 Use just-data_Mar_64.txt
14 Spambase 57 4601
15 Cardiotocography 23 2126
16 Statlog (German Credit Card) 20 1000
17 Letter Recognition 16 20000
18 EEG Eye State 15 14980
19 Page Blocks Classification 10 5473
20 Contraceptive Method Choice 9 1473
21 Weight lifting exercises monitored 10 39242 Use the following features: roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, (class as output)
22 Connect-4 42 67557
23 Mushroom 22 8124
24 Default of credit card clients 24 30000
25 Autism Screening Adult Data Set 21 704
26 Drug consumption (quantified) Data Set 32 1885
27 Polish companies bankruptcy data, Data Set 64 10503
PART-2
In this part of the project, all teams will use the dataset Unclean-Bank-Data.Xlsx posted on the “Project Description” page of the course website.
This dataset includes missing values, invalid values, and outliers. You should use the IBM SPSS Modeler nodes to pre-process and clean the data.
Do not remove a record if there is only one missing value in that record. Instead, use the IBM Modeler to fill in the missing value with an algorithm of your choice.
Similarly, do not remove a record if it has only one invalid value. Instead, use the IBM Modeler to fill in the invalid value with an algorithm of your choice.
If you find a record with more than one missing value, or more than one invalid value, then you may either remove the record, or use the IBM Modeler to fill in for the missing or invalid values.
If you detect outliers, you may then delete the entire record.
You may also want to do other pre-processing tasks such as data normalization, binning data, etc.
Deliverables for Part-2:
1- Include in your project report a short explanation of three different cleaning and pre-processing tasks you applied on the data using the IBM SPSS Modeller.
2- Also, include the clean dataset (name it “Clean-Bank-Data.xlsx”) in your submission together with your project report (you may submit everything in one zip file).
Rubrics for Part-2
Your report for Part-2 of the project will be evaluated as follows:
Components of the Report (Part-2) Points
Explanation of three cleaning and pre-processing tasks applied on the data; explaining the results after you pre-processed the data; including the Clean-Bank-Data.xlsx with your submission.
3 X 10
Total (Part-2) 30