辅导Programming程序、R程序语言调试、data编程讲解
- 首页 >> Java编程 Advanced Business Application Development/ Advanced Programming Application Development
Fall 2020
Assignment Three
Points: 40
Due: 12:00 PM November 2th
Format of your answer: For questions that ask you to run the code to analyze, please provide your code in R format first and then copy and paste the executed result from your Console. Make sure you label your R file appropriately so I could better assess your code.
For questions that ask you provide further insights based on your analysis result, you could provide your thoughts right after your analysis result.
You need to submit two files: 1: word document; 2: your R raw code for the corresponding questions
Example:
year <- rep(2008:2010, each=4)
quarter <- rep(1:4, 3)
cpi <- c(162.2, 164.6, 166.2, 167.0, 171.0, 172.1, 166.5, 166.0, 168.6, 169.5, 173.3, 174.0)
cbind(cpi, year, quarter)
> year <- rep(2008:2010, each=4) # use rep() to create a duplication set
> quarter <- rep(1:4, 3)
> cpi <- c(162.2, 164.6, 166.2, 167.0, 171.0, 172.1, 166.5, 166.0, 168.6, 169.5, 173.3, 174.0)
> cbind(cpi, year, quarter) # cbind () allows us to combine above three variables together
cpi year quarter
[1,] 162.2 2008 1
[2,] 164.6 2008 2
[3,] 166.2 2008 3
[4,] 167.0 2008 4
[5,] 171.0 2009 1
[6,] 172.1 2009 2
[7,] 166.5 2009 3
[8,] 166.0 2009 4
[9,] 168.6 2010 1
[10,] 169.5 2010 2
[11,] 173.3 2010 3
[12,] 174.0 2010 4
Question 1: Association Rules: Identifying Course Combinations (15 points)
The Institute for Statistics Education at Statistics.com offers online course in statistics and analytics, and is seeking information that will help in packaging and sequencing courses. Consider the data in the file Coursetopics.cvs, the first few rows of which are shown in the following. These data are for purchases of online statistics courses at Ststistics.com. Each row represents the courses attended by a single customer. The firm wishes to assess alternative sequencings and bundling of courses.
(Each column represents one statistics course; 1: attended; 0: no attended)
1.1Convert to a transaction database format and display this transaction database format in a readable form (2 points)
1.2Draw an item frequency plot and answer which statistics course was the most popular course (3 points)
1.3Build an association rule model and set the support value as 0.01 and the confidence value as 0.5. Based on your association rule results, show the first ten rules and sort by their lift values. Ensure to interpret your rule results, discuss which rules are the strong rules, and why
(5 point)
1.4Build an association rule model and set the support value as 0.05 and the confidence value as 0.3. Based on your association rule results, show the rules and sort by their lift values. Also compare your rules from 1.3 and rules from 1.4 and discuss why you get the different rules if applicable (5 point).
Question 2: Course ratings (recommendation systems) (10 points)
The Institute for Statistics Education at Statistics.com asks students to rate a variety of aspects of a course as soon as the student completes it. The Institute is contemplating instituting a recommendation system that would provide students with recommendations for additional courses as soon as they submit their rating for a completed course. Consider the courserating.csv from student ratings of online statistics courses shown in the following table, and answer the following questions.
# Pre-process on data: make sure you run process your data first by running the following instruction
# Once you load your courserating.csv data (I assume you name it as rating.df). You need to run the following codes to name each of your row observation as each student’s name to facilitate your analysis. Then, you will use this updated “rating.df” to run the rest of analysis
row.names(rating.df) <- rating.df[,1]
rating.df <- rating.df[,-1]
2.1Build an item-based recommendation system, predict ratings, and show the recommendation results for the first 5 users. (notes: make sure to convert your data frame to “matrix” first and then “realRatingMatrix” before you build your recommendation models; as.matrix(name of your object) allows your to convert your data frame to matrix; we mentioned how to convert matrix to realRatingMatrix in class) (4 points)
2.2Based on the same item-based recommendation system model, make a recommendation prediction to create top-2 recommendations for the first 4 users. (4 points)
2.3 Compare two types of recommender systems, user-based and item-based, and discuss their distinct characteristics (2 points)
Question 3: Pharmaceutical Industry (cluster analysis) (15 points)
An equities analyst is studying the pharmaceutical industry and would like your help in exploring and understanding the financial data collected by her firm. Her main objective is to understand the structure of the pharmaceutical industry using some basic financial measures.
Financial data gathered on 21 firms in the pharmaceutical industry are available in the file Pharmaceuticals.csv. For each firm, the following variables are recorded:
1: Name
2: Market_Cap (market capitalization in billions of dollars)
3: PE_Ratio (price/earnings ratio)
4: ROE (return on equity)
5: ROA (return on assets)
6: Asset_Turnover (asset turnover)
7: Leverage
8: Rev_Growth (estimated revenue growth)
9: Net_Profit_Margin (net profit margin)
3.1 Build a hierarchical model using Euclidean distance between records and average distance between clusters. Ensure to screen shot your plot. (note: remember to normalize your variables before building your model) (4 points)
3.2 Set the cut-off distance as 1.3 and 2.6 and show the membership of each cluster.
(4 points)
3.3 Build two k-means cluster models using k=4 and k=3 and report their total sum of within-cluster sum of squares (i.e., tot.withinss in R). Based on these two measures, answer which one is a better model. (5 points)
3.4 Compare hierarchical and k-means modeling and discuss their distinct characteristics (2 points)
Fall 2020
Assignment Three
Points: 40
Due: 12:00 PM November 2th
Format of your answer: For questions that ask you to run the code to analyze, please provide your code in R format first and then copy and paste the executed result from your Console. Make sure you label your R file appropriately so I could better assess your code.
For questions that ask you provide further insights based on your analysis result, you could provide your thoughts right after your analysis result.
You need to submit two files: 1: word document; 2: your R raw code for the corresponding questions
Example:
year <- rep(2008:2010, each=4)
quarter <- rep(1:4, 3)
cpi <- c(162.2, 164.6, 166.2, 167.0, 171.0, 172.1, 166.5, 166.0, 168.6, 169.5, 173.3, 174.0)
cbind(cpi, year, quarter)
> year <- rep(2008:2010, each=4) # use rep() to create a duplication set
> quarter <- rep(1:4, 3)
> cpi <- c(162.2, 164.6, 166.2, 167.0, 171.0, 172.1, 166.5, 166.0, 168.6, 169.5, 173.3, 174.0)
> cbind(cpi, year, quarter) # cbind () allows us to combine above three variables together
cpi year quarter
[1,] 162.2 2008 1
[2,] 164.6 2008 2
[3,] 166.2 2008 3
[4,] 167.0 2008 4
[5,] 171.0 2009 1
[6,] 172.1 2009 2
[7,] 166.5 2009 3
[8,] 166.0 2009 4
[9,] 168.6 2010 1
[10,] 169.5 2010 2
[11,] 173.3 2010 3
[12,] 174.0 2010 4
Question 1: Association Rules: Identifying Course Combinations (15 points)
The Institute for Statistics Education at Statistics.com offers online course in statistics and analytics, and is seeking information that will help in packaging and sequencing courses. Consider the data in the file Coursetopics.cvs, the first few rows of which are shown in the following. These data are for purchases of online statistics courses at Ststistics.com. Each row represents the courses attended by a single customer. The firm wishes to assess alternative sequencings and bundling of courses.
(Each column represents one statistics course; 1: attended; 0: no attended)
1.1Convert to a transaction database format and display this transaction database format in a readable form (2 points)
1.2Draw an item frequency plot and answer which statistics course was the most popular course (3 points)
1.3Build an association rule model and set the support value as 0.01 and the confidence value as 0.5. Based on your association rule results, show the first ten rules and sort by their lift values. Ensure to interpret your rule results, discuss which rules are the strong rules, and why
(5 point)
1.4Build an association rule model and set the support value as 0.05 and the confidence value as 0.3. Based on your association rule results, show the rules and sort by their lift values. Also compare your rules from 1.3 and rules from 1.4 and discuss why you get the different rules if applicable (5 point).
Question 2: Course ratings (recommendation systems) (10 points)
The Institute for Statistics Education at Statistics.com asks students to rate a variety of aspects of a course as soon as the student completes it. The Institute is contemplating instituting a recommendation system that would provide students with recommendations for additional courses as soon as they submit their rating for a completed course. Consider the courserating.csv from student ratings of online statistics courses shown in the following table, and answer the following questions.
# Pre-process on data: make sure you run process your data first by running the following instruction
# Once you load your courserating.csv data (I assume you name it as rating.df). You need to run the following codes to name each of your row observation as each student’s name to facilitate your analysis. Then, you will use this updated “rating.df” to run the rest of analysis
row.names(rating.df) <- rating.df[,1]
rating.df <- rating.df[,-1]
2.1Build an item-based recommendation system, predict ratings, and show the recommendation results for the first 5 users. (notes: make sure to convert your data frame to “matrix” first and then “realRatingMatrix” before you build your recommendation models; as.matrix(name of your object) allows your to convert your data frame to matrix; we mentioned how to convert matrix to realRatingMatrix in class) (4 points)
2.2Based on the same item-based recommendation system model, make a recommendation prediction to create top-2 recommendations for the first 4 users. (4 points)
2.3 Compare two types of recommender systems, user-based and item-based, and discuss their distinct characteristics (2 points)
Question 3: Pharmaceutical Industry (cluster analysis) (15 points)
An equities analyst is studying the pharmaceutical industry and would like your help in exploring and understanding the financial data collected by her firm. Her main objective is to understand the structure of the pharmaceutical industry using some basic financial measures.
Financial data gathered on 21 firms in the pharmaceutical industry are available in the file Pharmaceuticals.csv. For each firm, the following variables are recorded:
1: Name
2: Market_Cap (market capitalization in billions of dollars)
3: PE_Ratio (price/earnings ratio)
4: ROE (return on equity)
5: ROA (return on assets)
6: Asset_Turnover (asset turnover)
7: Leverage
8: Rev_Growth (estimated revenue growth)
9: Net_Profit_Margin (net profit margin)
3.1 Build a hierarchical model using Euclidean distance between records and average distance between clusters. Ensure to screen shot your plot. (note: remember to normalize your variables before building your model) (4 points)
3.2 Set the cut-off distance as 1.3 and 2.6 and show the membership of each cluster.
(4 points)
3.3 Build two k-means cluster models using k=4 and k=3 and report their total sum of within-cluster sum of squares (i.e., tot.withinss in R). Based on these two measures, answer which one is a better model. (5 points)
3.4 Compare hierarchical and k-means modeling and discuss their distinct characteristics (2 points)