代写Project #5 – Data Analysis Report 代做Python语言
- 首页 >> WebProject #5 – Data Analysis Report
Find a dataset that interests you, clean it, create some visualizations, and talk about what they tell you in a Report
Your Task
Your task is to carry out a little bit of data science on a dataset of your choosing. The finished product will be a report saved as a PDF walking through your code, explaining your dataset and what question you hope to answer with it, and the answers you have found.
In order to find a dataset, I recommend the website kaggle.com/datasets. From there you can browse for something that interests you, and filter the results to only show csv files. This dataset must be robust, with at least 50 unique rows and at least 4 columns. If you are unsure if your dataset qualifies, or want approval, you may post a link privately on piazza.
Once your dataset has been chosen, it is up to you to use your python knowledge to come up with some questions about your data, and find out the answers. You may use the codesets in the next sections and the examples we have done in class to base your analysis off of.
Once your program works to your satisfaction, you should start your report (PDF), which is what you will be submitting. The point of the PDF is to explain your data, your code, report your findings, and display the graphs you have created. See the next section to find examples to guide your work. Minimum: 1 page, Maximum: 10 pages, but your analysis must include all of these sections outlined ON THE NEXT PAGE.
There will be no autograder for this assignment, so be sure you are happy with how everything looks before the deadline. You will still be able to submit as many times as you want before the due date, but gradescope will provide no feedback.
You must include a header at the top of your PDF containing your name, “Project #5”, and the date you turn it in.
Save your PDF as Project5.pdf This is the only file that you need to submit to gradescope, though you may submit as many times as needed. When you submit, make sure to assign all pages to the appropriate section as gradescope asks. More info about this found in the submission section.
Specific Requirements/Guidelines
Your code should (at least):
● Import any libraries needed.
○ You do not need to use numpy, plotly, or pandas but can use them if you desire. Alternatively, you can finish this project doing everything without any outside libraries.
● Clean the existing dataset (or show how you know no cleaning must be done).
● Perform. 3 meaningful calculations with your dataset (calculating averages, standard deviation, counting categories, mode, median, ect),
○ By meaningful we mean these calculations should answer/help answer the questions you created about your dataset.
● Represent your calculations with an appropriate graph(s). (MUST HAVE AT LEAST ON GRAPH CREATED)
○ If you are having difficulty creating the graph, you may use another resource to create it (like google sheets or excel) although any data/calculation used have to come from your python code.
Your Report (PDF) should (at least):
0. Be between 1 and 10 pages in length.
a. There is some wiggle room here if you have lots of large graphs, or your
program is particularly long.
1. Link your dataset.
a. If you cannot easily insert the url, upload the file to google drive and insert the link to that dataset.
2. Describing any important information about your dataset:
a. What is the data?
b. Where did it come from?
c. Why were you interested in using it?
d. What needed to be done to clean it?
e. ect.
3. Describing the question you want to answer with your data.
a. What are you hoping to answer?
b. Why do you want to answer this question?
c. What will knowing the answer tell you about the world/the data?
d. ect.
4. Show and explain all the sections of code you have written
a. You do not need to explain every single line, but you should explain all the different sections, and any lines of particular interest.
b. Which lines were hard to write, how'd you overcome the challenge?
c. Which lines are performing something complex?
d. How are they doing that?
e. ect.
5. Showing and explaining graphs you’ve created
a. You must have at least one graph, and it must be added to your report so we can see it!
i. You can have more if you desire
ii. Generally you should have however many it takes to answer your question
b. What is your graph telling you?
c. Why did you create this graph in this style? (histogram vs bar chart vs pie chart, ect)
d. ect.
6. Summarizing your findings/answering your initial question.
a. Does your analysis answer the question?
i. What is that answer if it does,
ii. what else needs to be done if it doesn't
b. How did your analysis go, how would you change your process if you could?
c. ect.
d. EXTRA CREDIT: what would you want to do to follow up this report? How does your question change? What are the next steps?
7. You can have more sections if you desire, or if it will make your report better. But these bolded/numbered sections are the minimum we are looking for.
Testing, Example Projects, Useful Links
You still need to test your program, but because of the nature of data science, your testing doesn’t need to be too robust. You merely need to verify that it is running correctly, and will run correctly regardless of the person running it (NO absolute file paths, they must be relative so our testers can run your program).
Here are a couple finished projects from previous semesters:
(I haven't done this yet, hopefully I'll update soon)
Included below are several links to dataset walkthroughs that are good resources to base your analysis on. There are many more on the kaggle website (and elsewhere) so you should explore a bit. These codesets are much more robust than is needed for your project, but the decisions they make and the general structure should be illustrative:
● ���� Do People Like Pineapple on Pizzas? | Kaggle
● Tweets Cleaning and Visualization
● �� There’s No Place Like Home �� | Kaggle
● How can we prevent traffic congestion ? | Kaggle
And here are some codesets that work as introductions to relevant Data Science Topics:
● Explore Your Data | Kaggle
● Handling Missing Values | Kaggle
● Selecting and Filtering in Pandas | Kaggle
● Plotly Tutorial for Beginners | Kaggle
Finally here are some datasets I have found that might interest you, you can use these or find your own:
● Animal Crossing Catalog (Average costs for different types of clothes?)
● Pokemon with stats | Kaggle (What's the average speed stat of each type?)
● World Happiness Report | Kaggle (Which region has the happiest countries?)
● Spotify Dataset 1921-2020, 600k+ Tracks (Most/least popular genre over the years?)
● Chocolate Bar Ratings | Kaggle (country with highest rated chocolate bar?)
● U.S. Electricity Prices | Kaggle (States rates analysis, average for south vs north?)
● Pizza Hut Ratings and Reviews (sentiment analysis on good vs bad reviews?)
Extra Credit
You may do any, all, or none of these. You will not be penalized, but you won’t get any extra credit, if you attempt an extra credit and it doesn’t work correctly. I recommend at least attempting the extra credit, as some parts will be straightforward, and finishing them will lead to a more completed project. If you do attempt any of these, it would be helpful to label them clearly in your pdf, or assign correct pages when submitting to gradescop.
Extra Credit #1
Currently our PDF does not include any possible next steps for our project to go. For this extra credit you should include a section of our report that outlines where you think the project could go next, and what other possible analysis techniques or data could be used if you wanted a more robust answer to your question.
Extra Credit #2
Currently our PDF’s graphs are very sparse/basic. For this extra credit you should spruce them up to be a bit nice to look at, change the colors around, change the opacity, set a title and clear labels in your graph.
Extra Credit #3
For this extra credit you should use a function from either Pandas or Plotly in your project 3 that we have not directly talked about in class. It is okay if this plot/calculation isn’t contributing much to your main project. In your PDF you should explain what the function is, and how it is used. Lists of plotly and Pandas functions can be found here: Plotly Express and General functions — pandas 1.5.1 documentation (post publicly on piazza if you aren’t sure we’ve covered it in class.
Submission
When you are done, turn in the assignment via Gradescope, you should submit your pdf: which must include a link to your dataset, all described sections (here), and all the code you have written.
Again, there is no autograder for this project, so make sure you test before the due date passes. You should also proofread your PDF. We are not grading spelling or grammar, but if your statements are unclear to the point of incomprehensibility that could result in a loss of points.
When you submit the pdf to gradescope, you will be prompted to assign the pages in your document that cover each of the graded sections. You must complete this, or you will receive a grading penalty. Information about how to do this can be seen here: For Students: Submit Homework on the Gradescope Website
Submitting an Assignment - Gradescope Help Center
Grading
In the report of your grade, you will see a score and a set of letter codes explaining what you did wrong. If you get 10 points, there will be no associated letter codes.
The grading codes A-G are defined and will be the same for all programs that you turn in. A summary of those codes is as follows (see the linked document for a full explanation):
A: -10 Student’s name is missing from the top of the program and PDF.
B: -100 Program cannot run due to syntax errors.
C: -100 Program crashes before finishing.
D: -100 Program runs to completion but does not solve the intended problem.
E: -50 Program runs to completion but does not solve all of the assigned problems.
F: -50 Program uses overly advanced methods not covered in class
G: -50 Program works correctly, but majorly changes the assignment in order to do it.
In addition, penalties for this assignment only will be incurred for the following infractions (which may supersede some of the generic codes listed above):
H: -50 PDF does not offer any meaningful explanation of process
J: -40 PDF does not include any graphs or other pictorial representations of data
K: -10 Code does not use a publicly available dataset found online, or it is not linked in your PDF
L: -10 Dataset is not robust, or does not require any calculations for analysis
M: -10 PDF is too confusing to be easily understood
N: -10 PDF is missing any key section outlined above
O: -10 Student did not select pages for each section
Each of the three extra credit items are worth +3.33 points added to the score only if implemented correctly. Incorrect implementation will not be penalized.