辅导Data Analysis、辅导Python编程设计
- 首页 >> Algorithm 算法 1 Data Analytics Task - Climate Data Analysis using Python
1.1 General Overview
The assignment comprises code writing and data analysis. You are allowed to discuss ideas with peers, but your code and experiments and report must be done solely based on your own work.
The assignment leverages elements covered in class. You will be working with a couple of meteorological datasets, you will be required to crunch data, to clean the datasets and present correlations. Specifically, there will be three tasks you will be asked to solve.
The goals of the assignment are the following:
• To further develop your programming skills
• To further develop your skills and understanding principles of data analytics and machine learning
• To acquire experience in dealing with real-world data
1.2 Assignment description
You will find two pickle files named weather-denmark-resampled.pkl and df_perth.pkl, respectively and you will be asked to solve three different tasks. For TASK1 and TASK2 (covering the main aspects ofpreliminary data analysis, missing data and outlier detection), you will use the first dataset. For TASK 3, (covering correlation and pattern inferring) you will be using the second smaller dataset in order to find correlations and infer patterns.
Read carefully the three tasks description and address them using the pre-compiled Jupyter notebook named Coursework_weather_data.ipynb.
TASK 1 - PRELIMINARY ANALYSIS
In this first task, you will explore the dataset. Follow the instructions in the following:
a. Import the weather-denmark-resampled.pkl dataset provided in the folder and explore the dataset by answering the following questions.
i. How many cities are there in the dataset?
ii. How many observations and features are there in this dataset?
iii. What are the names of the different features?
b. Now that you got confident with the dataset, evaluate if the dataset contains any missing values? If so, then remove them using the pandas built-in function.
c. Extract the general statistical properties summarising the minimum, maximum, median, mean and standard deviation values for all the features in the dataset. Spot any anomalies in these properties and clearly explain why you classify them as anomalies.
TASK 2 – OUTLIERS
The second task is focused on spotting and overcoming outliers. Follow the instructions in the following:
d. Store the temperature measurements in May 2006 for the city of Odense. Then produce a simple plot of the temperature versus time.
HINT: In this dataset, the cities are vertically stacked. Therefore, we have a multi- column dataset, which basically works as a nested dictionary.
e. Find the outliers in this set of measurements (if any) and replace them using your own choice of interpolation.
TASK 3 – CORRELATION
In this last task, you will be seeking correlation between features of the data. For this task, you will be working with a smaller dataset. Follow the instructions in the following:
CORRELATION
f. We now take a new dataset (df_perth.pkl), which collects climate data of a city in Australia. Here we have just one year of measurements, but more features.
g. Find any significant correlations between features.
HINT: you might find it useful to look for trends and recurrent patterns within the data.
h. We now focus on the correlation between precipitation and cloud cover. We want to infer the probability of having moderate to heavy rain (> 1 mm/h) as a function of the cloud cover index.
HINT: you mightfind it useful to create a new column where you have 0 ifprecipitation < 1 mm/h and 1 otherwise.
1.3 Deliverable [Data Analysis Report]
The report should be written in the form of an academic paper using the ICML format1 . The report should be at most 8 pages long excluding references and appendices. The report must include the following sections:
● Abstract. This section should be a short paragraph (4-5 sentences) that provides a brief overview of the methodology and results presented in the report.
● Preliminary Analysis. This section describes your study carried out during task 1 and should be organized into the following subsections:
• Data Understanding. This subsection should detail the data that was used for this study, clearly describing the content, size and format of the data, how many cities are described in the dataset, how many observations and how many (and which) features are considered . Further information can be provided.
• Data Cleaning. This subsection should describe the missing data processing. It is important to describe the methodology that you used in searching for the missing data and how did you address them in the best way (for example how do you ensure that the dataset preserver the same statistics/properties). Motivate clearly your answers.
• Data Statistics. This subsection should describe the general statistical properties of the dataset with numerical or graphical visualization. Provide reflections toward anomalies (with clear motivation/supporting evidence for anomalies)
● Outliers. This section should describe all the steps that were applied to the data
1 https://icml.cc/Conferences/2020/StyleAuthorInstructions
to find and tackle outlier pre-processing. A justification for each step should also be provided. In case no or very little pre-processing was done, this section should clearly justify why.
● Data Correlation: This subsection should describe the different features correlations that you have investigated in the current dataset. Even if you discover little patterns, it is important that you clearly explain and justify the methodologies that you adopted. Clearly show results that can support your statements.
● Conclusion. This last section summarises the findings, highlights any challenges or limitations that were encountered during the study and provides directions for potential improvements.
Please make sure you complement your discussion in each section with relevant equations, diagrams, or figures as you see fit. Most importantly, be sure that all your answers and solutions are well motivated.
Marking Criteria
See the following page for the marking criteria
Criteria Mark Weight
Abstract/ Conclusions The purpose of the executive summary is to outline data analytics project, input, envisioned outputs as well as key findings
10%
Task 1 -
Preliminary
Analysis Dataset Understanding. Provide a clear description of the dataset answering the following questions: i) How many cities are there in the dataset? ii) How many observations and features are there in this dataset? iii) What are the names of the different features?
10%
Data Cleaning – Missing data. Provide a clear description of the results from your missing data analysis and key outcomes.
10%
Data Statistics. Describe the general statistical properties of the dataset with numerical or graphical visualization. Provide reflections toward anomalies (with clear motivation/supporting evidence for anomalies)
10%
Task 2 –
Outliers
Show the visualization of the temperature measurements, together with some comments on the behaviour depicted in the plots. Provide summaries on the outliers – in terms of number of outliers detected as well as techniques adopted to replace outliers (motivate your answers).
25%
Task 3 –
Inference Data Correlation. Comment on the significant correlation you found between features and assess rain probability as a function of cloud cover index. Support the text with visualization of results and key insights on the considered approach.
25%
Report Style Report needs to be with a clean and clear structure as well as layout. Quality of images, table, citations and references will be also taken into account. 10%
1.1 General Overview
The assignment comprises code writing and data analysis. You are allowed to discuss ideas with peers, but your code and experiments and report must be done solely based on your own work.
The assignment leverages elements covered in class. You will be working with a couple of meteorological datasets, you will be required to crunch data, to clean the datasets and present correlations. Specifically, there will be three tasks you will be asked to solve.
The goals of the assignment are the following:
• To further develop your programming skills
• To further develop your skills and understanding principles of data analytics and machine learning
• To acquire experience in dealing with real-world data
1.2 Assignment description
You will find two pickle files named weather-denmark-resampled.pkl and df_perth.pkl, respectively and you will be asked to solve three different tasks. For TASK1 and TASK2 (covering the main aspects ofpreliminary data analysis, missing data and outlier detection), you will use the first dataset. For TASK 3, (covering correlation and pattern inferring) you will be using the second smaller dataset in order to find correlations and infer patterns.
Read carefully the three tasks description and address them using the pre-compiled Jupyter notebook named Coursework_weather_data.ipynb.
TASK 1 - PRELIMINARY ANALYSIS
In this first task, you will explore the dataset. Follow the instructions in the following:
a. Import the weather-denmark-resampled.pkl dataset provided in the folder and explore the dataset by answering the following questions.
i. How many cities are there in the dataset?
ii. How many observations and features are there in this dataset?
iii. What are the names of the different features?
b. Now that you got confident with the dataset, evaluate if the dataset contains any missing values? If so, then remove them using the pandas built-in function.
c. Extract the general statistical properties summarising the minimum, maximum, median, mean and standard deviation values for all the features in the dataset. Spot any anomalies in these properties and clearly explain why you classify them as anomalies.
TASK 2 – OUTLIERS
The second task is focused on spotting and overcoming outliers. Follow the instructions in the following:
d. Store the temperature measurements in May 2006 for the city of Odense. Then produce a simple plot of the temperature versus time.
HINT: In this dataset, the cities are vertically stacked. Therefore, we have a multi- column dataset, which basically works as a nested dictionary.
e. Find the outliers in this set of measurements (if any) and replace them using your own choice of interpolation.
TASK 3 – CORRELATION
In this last task, you will be seeking correlation between features of the data. For this task, you will be working with a smaller dataset. Follow the instructions in the following:
CORRELATION
f. We now take a new dataset (df_perth.pkl), which collects climate data of a city in Australia. Here we have just one year of measurements, but more features.
g. Find any significant correlations between features.
HINT: you might find it useful to look for trends and recurrent patterns within the data.
h. We now focus on the correlation between precipitation and cloud cover. We want to infer the probability of having moderate to heavy rain (> 1 mm/h) as a function of the cloud cover index.
HINT: you mightfind it useful to create a new column where you have 0 ifprecipitation < 1 mm/h and 1 otherwise.
1.3 Deliverable [Data Analysis Report]
The report should be written in the form of an academic paper using the ICML format1 . The report should be at most 8 pages long excluding references and appendices. The report must include the following sections:
● Abstract. This section should be a short paragraph (4-5 sentences) that provides a brief overview of the methodology and results presented in the report.
● Preliminary Analysis. This section describes your study carried out during task 1 and should be organized into the following subsections:
• Data Understanding. This subsection should detail the data that was used for this study, clearly describing the content, size and format of the data, how many cities are described in the dataset, how many observations and how many (and which) features are considered . Further information can be provided.
• Data Cleaning. This subsection should describe the missing data processing. It is important to describe the methodology that you used in searching for the missing data and how did you address them in the best way (for example how do you ensure that the dataset preserver the same statistics/properties). Motivate clearly your answers.
• Data Statistics. This subsection should describe the general statistical properties of the dataset with numerical or graphical visualization. Provide reflections toward anomalies (with clear motivation/supporting evidence for anomalies)
● Outliers. This section should describe all the steps that were applied to the data
1 https://icml.cc/Conferences/2020/StyleAuthorInstructions
to find and tackle outlier pre-processing. A justification for each step should also be provided. In case no or very little pre-processing was done, this section should clearly justify why.
● Data Correlation: This subsection should describe the different features correlations that you have investigated in the current dataset. Even if you discover little patterns, it is important that you clearly explain and justify the methodologies that you adopted. Clearly show results that can support your statements.
● Conclusion. This last section summarises the findings, highlights any challenges or limitations that were encountered during the study and provides directions for potential improvements.
Please make sure you complement your discussion in each section with relevant equations, diagrams, or figures as you see fit. Most importantly, be sure that all your answers and solutions are well motivated.
Marking Criteria
See the following page for the marking criteria
Criteria Mark Weight
Abstract/ Conclusions The purpose of the executive summary is to outline data analytics project, input, envisioned outputs as well as key findings
10%
Task 1 -
Preliminary
Analysis Dataset Understanding. Provide a clear description of the dataset answering the following questions: i) How many cities are there in the dataset? ii) How many observations and features are there in this dataset? iii) What are the names of the different features?
10%
Data Cleaning – Missing data. Provide a clear description of the results from your missing data analysis and key outcomes.
10%
Data Statistics. Describe the general statistical properties of the dataset with numerical or graphical visualization. Provide reflections toward anomalies (with clear motivation/supporting evidence for anomalies)
10%
Task 2 –
Outliers
Show the visualization of the temperature measurements, together with some comments on the behaviour depicted in the plots. Provide summaries on the outliers – in terms of number of outliers detected as well as techniques adopted to replace outliers (motivate your answers).
25%
Task 3 –
Inference Data Correlation. Comment on the significant correlation you found between features and assess rain probability as a function of cloud cover index. Support the text with visualization of results and key insights on the considered approach.
25%
Report Style Report needs to be with a clean and clear structure as well as layout. Quality of images, table, citations and references will be also taken into account. 10%