代写INFS5720 Business Analytics Methods Term 1, 2025代写数据结构语言

2025.06.03 - 首页 >> Python编程

INFS5720

Business Analytics Methods

Individual Assignment

Term 1, 2025

This assignment covers Lecture 1 to 3. It accounts for 15% of the final grade for Business Analytics Methods. The deadline is 21 March 2025, 15:00:00. Do not wait till last minute. Late submissions (even by a few seconds) will still be marked as late submission by Moodle. The teaching team strictly follows the flagging mechanism of Moodle. UNSW has a standard late submission penalty of:

5% of the full marks per day

capped at five days (120 hours) from the assessment deadline, after which a student cannot submit an assessment

no permitted variation

You are to submit a WORD document (not PDF) to Moodle, Left menu > Assessments Hub > Individual Assignment > Individual Assignment Submission. Turnitin is turned on to check similarity score among all submissions. To avoid a high Turnitin score, do NOT copy the assignment questions into the report. The similarity score is not generated upon submission. This is to avoid students relying on Turnitin score and tune the similarity score by repeated resubmission. If the work is done independently, the similarity score should not be an issue.

Every page’s header should contain Your zID, similar to this Individual Assignment guideline file. Do NOT write your name. A cover page is optional.

Please use "Your zID" for Submission Title when you upload. The file name should also be “Your zID.docx”. Submissions that do not adhere to this will be penalized.

Details of report format:

Length: should not exceed 4 pages, including the relevant graphs, tables,

references, screenshots, and appendices (if any), but excluding the cover page (a cover page is optional). This limit is deliberately set as 4 pages, to ensure that AI’s lengthy answers are summarized succinctly and to the point.

Font Style. Times New Roman for writing; Courier New for code (if any)

Font size: 12 for writing; 10 for code (if any)

Line spacing: 1

Margins: 1 inch or 2.5cm for the top, bottom, right and left

Include the page number on each page

Up to 25% of full marks as penalties will be imposed for inappropriate or poor paraphrasing. Serious cases will be investigated. More information on effective paraphrasing strategies can be found on

https://www.student.unsw.edu.au/paraphrasing-summarising-and-quoting.

Your writing should be succinct but not at the expense of excluding relevant details.

Use plain and simple language. Some questions may not come with absolutely right or wrong answers, and you have the liberty to express your views about the problem.

However, your points must be supported by evidence and sound reasoning. It is the quality and not the length that counts. Make sure you follow the report guidelines and style. specified in this assignment.

Please follow APA style. of referencing. More details can be found at

https://www.student.unsw.edu.au/apa. Where students use ChatGPT or any Generative AI tool in their work, this must be appropriately cited according to discipline norms, e.g., right below the written paragraph that used Generative AI, or included in appendix. How to reference Generative AI within APA can be found at

https://apastyle.apa.org/blog/how-to-cite-chatgpt

Any student may be called upon to provide a viva voce (from the Latin meaning ‘living voice’) for any assignment. A viva voce is an interview style. meeting where you will be asked to explain, discuss, or use information related to any assignment or work produced for this course. These can be used to ascertain knowledge and ability including the extent to which the student has undertaken the required reading, done preparatory work and can demonstrate understanding of what they have written or presented. Viva voces are used in conjunction with submitted assessment work not instead of submitted work. (Used with permission created by Assoc Prof. Lynn Gribble, UNSW Sydney.)

The answers should be presented in order according to the sequence of the questions listed in the assignment; that is, in the order of Q1 a), Q1 b), Q2 a), etc. You can have several sub-sections within a section if you deem appropriate. The report must be self- contained. It is essential to include all relevant tables and figures as evidence to support your answers.

Summary:

• Write in plain English clearly and succinctly

• Write appropriately to the context (AI’s answer is usually too generic)

• Provide a reference at the end of the report

• Good overall presentation of the report

Overview

“Individual Assignment.ipynb” is to guide students with standard operations on data set, and, in some cases, provide model implementation that is almost complete, so that students can focus on interpreting the results. Do NOT submit the .ipynb file.

The total marks of this assignment are 60 marks.

As an Analyst in the Analytics team of a women's hospital, your role is to analyse patient data from the diabetes diagnosis process. Your goal is to uncover patterns, assess risk factors, and provide insights that can help improve early detection, patient care, and treatment strategies for diabetes within the female patient population.

The dataset is in ‘Diabetes_Diagnosis.csv’. The description of the table is in ‘Diabetes_Diagnosis_Description.xlsx’ .

Before you run any code ofa sub question, please read the description and the instructions for that sub question in the code file very carefully, to understand the purpose of the code and how to run the code correctly.

Question 1

We will use K-means to study the hidden patterns in this dataset. Pre-processing step uses normalisation, with MinMaxScaler, with predetermined min and max, to reduce the range of all columns to [0,1]. This is important for all variables to have equal impact on the clustering results.

(a) There are two options to run K-means clustering algorithm. Option 1 is to use all columns. Option 2 is to exclude ‘Outcome’ column. The given code produces each variable’s distribution in each cluster and specifically compares each scaled and original column's mean and median values across all clusters. Discuss which option produces more useful clustering results and why. (10 marks)

(b) In the given code, we run K-means with k ranging from 2 to 15 and plot the elbow line with respect to Sum of Squared Distances. A plot regarding the Average Silhouette Score is also provided for your reference. Pick the best k in your opinion and state your reason why this k value is the best. (10 marks)

(c) Rerun K-Means with the best k value in your opinion. Run the given code to see the data distribution of all columns in each cluster. Based on the variables that are significantly different across different clusters, study the unique characteristics of each cluster, and give an intuitive name to each cluster, so that you can quickly convey the cluster results to the medical team. For each cluster, make suggestions to various medical teams how they shall handle each cluster differently in the next steps e.g. follow up consultations, health checkup reminders etc. (10 marks)

Question 2

Your next task is to predict whether the patient has diabetes, by building a Logistic Regression Model. You are predicting the ‘Outcome’ column, using all other columns as input variables.

(a) Run the given code of Logistic Regression. Discuss the P-values and coefficients generated for two variables: ‘SkinThickness ’ and ‘Pregnancies ’. Explain in plain English the impact of these two input variables on the target variable Outcome. (10 marks)

(b) We define target variable utcome=1 as the positive class, i.e., the patient has diabetes. Explain in plain English what False Negative (FN) case and False Positive (FP) case are. Discuss which one, FN or FP, is worse and whether the predictive model of your hospital should be optimized for Precision or Recall. (10 marks)

(c) The model above uses a default threshold of 0.5 for diagnosing diabetes. Run the given code to try threshold from 0.1 to 0.9. As the threshold goes up from 0.1 up to 0.9, what do you observe about Precision and Recall? Based on the hospital’s goal of minimising misdiagnosed cases while ensuring timely intervention, suggest the best threshold and justify your choice. (10 marks)