代写INFS 5135 Assignment – Analysis of a dataset代写留学生Matlab程序

2025.01.02 - 首页 >> C/C++编程

INFS 5135 Assignment – Analysis of a dataset

Introduction

The aim of the assignment is to introduce you to analysis of routine data sets (“wild datasets”). You will need to explore issues such as writing Data dictionaries; assessing data quality, explore the data using visual tools and perform. some data wrangling; consider and perform. data analysis and write a comprehensive report including account on your findings and summarising recommendations.

For the assignment, you will be given a general scenario and a suggestion of a raw dataset. You will need to explore the given problem in more depth – this includes finding more data (datasets) relevant to the job.

You will be working in groups to produce both group and individual deliverables.

Project methodology

Data can be a product of a meticulously planned study, or it can be a side-product of practice (wild datasets). While planned studies typically yield well defined, clean data, these studies are typically expensive both in terms of money and other resources. Such effort is not sustainable in the long term.

Data produced as part of routine activities or observation are, on the other hand, readily available with minimal cost. However, such data are typically incomplete, contain possible errors and require cleansing and transformation before they can be used beyond their primary purpose.

The framework we will be using for this assignment was developed by industry as Cross-Industry Standard Process for data Mining (CRISP-DM). This process has several phases:

Business understanding

Before you start any attempt to collect/analyse data you need to get a good idea why you are doing the exercise – understand the purpose. The main components are:

• Determine business objectives

– Initial situation/problem etc.

– Explore the context of the problem and context of the data collection (…types of organisations generating the data; processes involved in the data creation...)

• Assess situation

– Inventory of resources (personnel, data, software)

– Requirements (e.g. deadline), constraints (e.g. legal issues), risks

Understanding your business will support determining the scope of the project, the timeframe, budget etc.

NB: The direction of your analysis is determined by your business needs. An attempt to analyse a dataset without prior identification of the main directions would lead to extensive exploration. While this may be justified in some cases, in real business it is seldom required. You are NOT doing academic research aiming to create new knowledge – you are trying to get answers to drive your business decisions!

Data understanding

Next step is to look at what data is needed (available) and write data definitions (so that we know exactly what we talking about – this is very important for aggregation of apparently same data: the definitions may not be the same! – blood pressure data may look exactly the same – but there is indeed a difference whether it is data acquired at the ICU via intra-arterial cannula; or it is a casual self-monitoring measure the patient is doing by himself at home; nailing down date format is important – especially when aggregating data from different sources – 02/03/12 can be 2^nd of March 2012; 3^rd of February 2012, 3^rd of December 2002; explicitly describe any coding schemas, etc. …).

• Collect initial data

– Acquire data listed in project resources

– Report locations of data, methods used to acquire them, ...

• Describe data

– Examine "surface" properties

– Report for example format, quantity of data, ... à Data dictionary

– NB: data dictionary summarises your knowledge on each piece of data – this description can be considered to be part of the dataset – each piece of data comes with metadata describing meaning, coding, context of collection etc. In many cases you will be given these descriptions along with the dataset

• Explore data

– Examine central tendencies, distributions, look for patterns (visualisations etc.)

– Report insights suggesting examination of particular data subsets (data selection)

• Determine data quality (consider the dimensions of data quality)

– Completeness

– Uniqueness

– Timeliness

– Validity

– Accuracy

– Consistency

NB: this is an initial exploration – scouting the problem space. It helps you to understand what data is available and it helps to align your approach to the business objectives and the data available. At the same time – this phase can help to verify, whether the project is viable (feasibility) and refine the project scope, budget, resources etc.

This phase is very different to a typical research prospective approach where you design the study in a way you always know what you are getting…

Data preparation

Typically, the data you get is not in the right format for analysis (it was collected for other purposes) and needs to be pre-processed

• Select data

– Relevance to the data mining goals

– Quality of data

– Technical constraints, e.g. limits on data volume

• Clean data

– Raise data quality if possible

– Selection of clean subsets

– Insertion of defaults

• Construct data

– Derived attributes (e.g. age = NOW – DOB; possibly subsequent coding of age into buckets etc.) – do not forget to add these attributes to your data dictionary!

• Integrate data

– Merge data from different sources

– Merge data within source (tuple merging)

• Format data

– Data must conform. to requirements of initially selected mining tools (e.g. input data is different for Weka, and different to Disco).

Modelling

This phase goes hand-in-hand with the data preparation. Here you select what analytic techniques you are planning to use, in which sequence etc. Once you have the analysis design, you execute it.

• Select modelling technique

– Finalise the methods selection with respect to the characteristics of the data and purpose of the analysis

– E.g., linear regression, correlation, association detection, decision tree construction…

• Generate test design

– Define your testing plan – what needs to be done to verify the results from analysis (verify the validity of your model). E.g.:

• Separate test data from training data (in case of supervised learning)

• Define quality measures for the model

• Build model

– List parameters and chosen values

– Assess model

At the end of the Data preparation/Modelling phase you have a set of results coming from the analysis (you have a model).

NB: this needs to be assessed and evaluated from the technical point of view (to mitigate issues such as overfitting etc.).

Evaluation

Here you evaluate the results (model) from the business perspective (Did we learn something new? How do the results fit into knowledge we already have? Does the predictive model work? etc.).

• Evaluate results from business perspective

– Test models on test applications if possible

• Review process

– Determine if there are any important factors or tasks that have been overlooked

• Determine next steps (Recommendations)

– Depending on your analysis (results, interpretations) you need to recommend, what will be the next step. In general, the next step can be:

• Deploy the solution (you reached as stage where you got a viable solution)

• Kill the project (you exhausted all meaningful options and decide, that continuation of the project is not viable/feasible from the business point of view)

• Go into the next iteration.

• Improve the model.

• Build an entirely new model.

NB: Do not jump to decisions without the analytic evidence to support such decisions (recommendations).

Deployment

In this phase you conclude the project.

• Plan deployment

– Determine how results (discovered knowledge) are effectively used to reach business objectives

• Plan monitoring and maintenance

– Results become part of day-to-day business and therefore need to be monitored and maintained.

• Final report

• Project review

– Assess what went right and what went wrong, debriefing

NB: Deployment can be a launch of a new project with its own problems. E.g. you have a static data extract you can use to develop a solution. Once you have a viable solution, deploying it will require connection to live data input feeds. This opens a whole new set of issues to be solved:

· Automate data extraction

· Automate semantic interoperability and data linkage

· Automate data quality monitoring

· Design, develop and deploy security context

· Etc.

Caveats

The CRISP-DM framework describes the phases in a rather linear (cyclic) fashion. In theory, it can be done that way. However in reality, this is an exploration process frequently based on the try-and-err basis. You will work with the data and use frequent visualisation to “see” the patterns. Then you confirm what you “see” with more formal statistics.

General scenario

A US consulting company was engaged to analyse job market for people with business analyst qualifications. They were able to scrape job data from LinkedIn on job listings posted in 2024.

Your task is to look at patterns related to jobs requiring Business Analyst qualification (such as what jobs require this skill, what employers look for people with this skill, where the jobs are located, what are the other skills listed along with Business Analyst skill etc.).

You will have to deal with several challenges, such as size of the datasets, decomposing skills listed in one field, matching skills to job descriptions etc.

You will need to explore the problem space (reading and mind maps), declare the narrowed-down focus (the time and resource limitations do not allow to do a complete study). You will need to decide how to work with a collection of large data sources, extract relevant parts and possibly find additional data (from public sources). In this course you are expected to do the first iteration ( and recommend next steps at the end of it – this typically leads to planning of 2^nd iteration of the project) NB: you may not be able to reach a stage when you have a business solution, so do not jump to conclusions!

Business understanding

Explore the dataset and the source of this data. You will discuss this in your group and document the discussion by drawing mind maps (individual as preparation for the group discussion; then final group mind map representing your understanding of the problem).

Annotate (CRAAP) relevant publications (you annotate 2 publications, but you read as many as necessary). Brainstorm and summarise your findings in the group. Decide on the focus for your analysis – what factors you expect to go into the model and why.

Write a brief justification of a project – make your decisions explicit.

Analysis of data

For this part of your assignment, you will need to identify and acquire relevant datasets from public sources. Your task will be to have a look at the data (with your understanding from your previous reading – if you do not have enough idea, you will need some more reading) to understand your data:

· Extract a data dictionary from the data source documentation and add description of any data you construct. Note any assumptions you made. NB: you add all you know about each piece of data into the data dictionary.

· Select which data you will be using for your analysis (and justify your choice)

· Consider any additional data/datasets you may need for your analysis (and document them)

· Construct data you think you need – justify why you need this data, and describe in detail (in data dictionary) how you are going to construct the data point (formulas, …)

· Explore the data (e.g. basic statistics, graphs…)

· Comment on data quality (refer to the 6 dimensions of data quality mentioned in the lecture) – BOTH at the dataset level (e.g. selection bias) and variable level

Based on your understanding of the purpose of analysis and the data you got you:

· Make your choices on analytic methods (start with basic stats and visualisations) and justify your choices.

· Formatting/re-formatting data – what changes need to be done for methods you apply (NB: if there is no need for re-formatting, briefly state this)

· Write an analysis plan – to discuss in the class

At this point you should have a reasonably clear idea on what you plan to do with the data, as well as what transformations were needed to prepare the data. You execute your analytic plan (modelling...):

· Perform. the analysis as you propose it considering any comments you may have got.

You may need to go back to data preparation or do some additional reading – the process is not linear! Do not forget to check the (technical) validity of your results (e.g. overfitting...)

Now you have your results, you evaluate them and write comments and recommendations. You need to discern findings (facts you found; evidence coming directly from your analysis); interpretations (what do *you* think the findings mean - use your data/business understanding here) and finally recommendations (what you suggest being next steps: such as – do more analysis, collect a specific dataset, do a study focussing on something more specific; or how to use the model if you think it is good). Make sure your recommendations are consistent with your findings and interpretations.

(Evaluation of a predictive model – generate a confusion matrix; comment on it and recommend what might be the next steps to improve the performance of the model)

Formatting

Your document is supposed to be aimed to professional audience (consulting company) – adjust your style. accordingly. Both assignments form. one project, so you re-cycle some of your deliverables from assig 1 in Assig 2 (data dictionary, data quality comments etc.). Re-using some of these components may lead to higher Turnitin scores.

Please do not write lengthy introductions (your audience is expected to know their business!).

Images and tables are expected to have captions. Lengthy components (such as Data dictionary) are expected to be presented as appendices (and referred to in the document whenever appropriate).

Use references only if you need them (no merit in “backfilling” references) but use as many references as necessary to document any work which is not yours. Preferred format is Harvard, but you can use any other format if you use it consistently throughout the entire document.

Word count – you use any number of words you need. No penalty will be for exceeding the word count. I may consider deducting points for excessive “fluff” (unnecessary fillers).

Assignments

The work described above is split into 2 assignments. The following sections describe the expected deliverables for groups and individuals (NB: in many cases the group deliverable is derived from individual contributions).

Assignment 1

In this part you will do:

Group:

· Mind map of the problem (result of brainstorming; distillation from individual mind maps)

· Project justification and scope (explain what is going to be the main goal of your analysis)

· Data dictionary – this includes a consistent description of data – both copied/adjusted from the data source and description of data the group members constructed

· Summary of dataset exploration (datasets you were given PLUS any additional datasets you consider using in your analysis – e.g. socio-economic data...).

· Analysis plan with justification and assigning work to individuals (What do you think the data is saying you – based on your preliminary exploration, and how you are going to confirm/reject our hypotheses with science – statistic/analytic methods...)

Individual

· Bibliography - find, read, and evaluate (CRAAP test) 2 sources as a basis for group discussions and brainstorming. Draft an individual mind map.

· Mind map of the problem (your individual preparation for the group discussion)

· Data dictionary⁸ – detailed description of the data you are going to use (including data construction -- any derived data you may need to create)

· Result of exploring the data – each group member will submit result of their exploration of the dataset (what was done, why it was done, what was the result, what do you think about the result; this includes any visualisations you have done)

· Data quality analysis – at the dataset level and at each variable level (what are the problems, can they be fixed? How data quality will influence the validity/trustworthiness of results...). Refer to data quality dimensions to structure this deliverable. Use quantitative measures whenever applicable (e.g. you identify missing data: you need to state what is the proportion of missing data for each variable; look for patterns of missing data – randomly distributed or some specific relationships...).

NB: Check the course calendar on what task is due at what time. Feedback will be given in the practical class (external students- feedback will be either in writing by e-mail, or via phone/teleconference).

Hint: you may use a smaller extract (sample) from the large dataset for initial exploration. Then you will have to think, how you prepare the dataset you will be analysing (removing unnecessary parts etc.)

Assignment 2

In this part you will do:

Group:

· Results of analysis compilation – summary of results/findings produced by individual group members (results from your objective assessment: in your exploration you identified interesting patterns/relationships – now you need to confirm these with objective methods and report the results. E.g. exploration: 2 lines appear to correlate à hypothesis of correlation à you calculate correlation coefficient à result confirming/rejecting the correlation)

· Description of the model you created, how you tested it and its performance (test results)

· Final report - includes interpretation of findings and Recommendations

Individual:

· Analysis – as assigned by the group in the analysis plan

· Result of analysis (what you found, what the data tells you – i.e. trends, patterns; results of testing your model etc.; NB: this is about facts, not interpretations or opinions)

· Interpretation of your findings (here is where you express your opinions, interpretations etc. of what you found. Interpretation puts your results into context with other aspects of the “business” – you may need to do some additional reading). In this section you submit all your results and interpretations, even if they do not become part of the group final report.