代做AMLH 2024 Assignment Instruction调试Python程序

- 首页 >> Matlab编程

AMLH 2024 Assignment Instruction (Full)

Deadline for submission:  Wednesday, 10thJuly, 2024, 17:00 GMT

• Length: The summative assessment refers to the outcome of the entire module and culminates in a documented paper of 2000-2500 words (excluding figures, references) and the submission of an iPython-notebook.

• Python packages: tensorflow, pytorch, keras, scikit-learn, etc.

• Each student needs to select one dataset as your choice from the provided list. The maximum number of students for any topic/dataset is 23. First come first served. (Last year, all students  selected their favourite topics)

Dataset options for the Assignment:

A.  CT 3D Volume Segmentation

B.   Glucose prediction for T1D

C.  Identify phenotypes from clinical notes

The assignment should follow this overall outline (the requirements of each dataset are provided in a separate section):

1. Introduction Words:

Please describe the background, motivation and importance of the data in light of related literature. Show a sound interpretation of the medical problems presented in the data. Outline the selected dataset (including features and class labels) and provide descriptive statistics of the contained variables. Visualise the data or feature space in a plot if it is possible and explore the underlying characteristics.

[15 marks]                                              400

2. Methodology

2.1 Preprocessing Words:

Please describe details of preprocessing of the data, including data cleaning, imputation, normalization, augmentation, up or down sampling, feature engineering etc. of the chosen dataset.

[15 marks]                                             450

2.2 Algorithm design and implementation Words:

Select two AI models of the course, or two neural networks which have very different structures. Given a high-level description of both algorithms including their rationale or model structure. Describe and demonstrate for both algorithms:

a) How to generate training and testing data with appropriate format that suit the AI model

b) Optimization of hyper-parameters

c) Model evaluation based on the outcomes including widely-used criteria that mentioned in the course

Demonstrate your solution with an attached iPython notebook. Ensure reproducibility and transparency.

[35 marks]                                                                650

3. Results Words:

Present optimized hyper-parameters and reasonable evaluation criteria such as a confusion matrix, precision, recall, RMSE, MARD, F1, AUC and a ROC-plot, etc. Provide an analysis for both algorithms with different parameters and give a textual description of the results. (please choose appropriate metrics carefully)

[25 marks]                                               600

4. Discussion and Conclusion Words:

Compare and discuss your findings (results of two algorithms). If it is possible, it would be good to compare with other scientific publications that used the same medical dataset. Discuss how you would improve your methodology, current limitations and future work.

[10 marks]                                                400

5. Reference

6. Appendix

Attach a reproducible iPython Jupiter notebook.

Total: 100 marks

Plagiarism or collusion is not allowed. The module can be failed straightaway if the assignment or codes notebook fails in the plagiarism test.

Markers will look for the following sections in the assignment:

•    Sound understanding of the provided dataset and appropriate pre-processing to obtain a dataset that is suitable for the machine learning model

•    Appropriate selection and learning/training of two algorithms (one of them can be

seen as the baseline) to address the target problems in medical imaging, time series or NLP

•    Evaluation of the performance and meaningful discussion

•    Other requirements that have been asked in the associated dataset instruction

•    The layout, presentation, references of the paper/report

Please there is any question regarding the dataset or coursework, please post it in the forum on moodle page, or email Kevin/Ken/Honghan directly.

Provisional mark and feedback will be released in Autumn 2024.

CT 3D Volume Segmentation

Data:

1. Introduction:

In this assignment, we will use a subset of the Pancreas data from the Medical Segmentation Decathlon. Your task will be to segment the 3D volumes and identify the regions of interest  (pancreas).The dataset contains 35 3D volumes (30 train and 5 test volumes) from portal venous phase CT scans.

Images were provided by Memorial Sloan Kettering Cancer Center (New York, NY, USA).

2.Data format

The imagesdata are greyscale 256 x 256 images. The volumes are constructed with different number of slices per scan.

The ground truth label data has 3 classes:

•    0 if the pixel is part of the imagebackground

•    1 if the pixel is part of the pancreas

•    2 if the pixel is part of the cancer

Reference

http://medicaldecathlon.com/

Antonelli, M., Reinke, A., Bakas, S. et al. The Medical Segmentation Decathlon. Nat Commun 13, 4128 (2022). https://doi.org/10.1038/s41467-022-30695-9

Problem formulation

Your task will be to segment the 3D volumes and identify the regions of interest (pancreas).

You will compare the performance of a baseline model and an improved model.You will need to correctly process the image data into a suitable format for your model.In this assignment, you are not expected to segmentthe tumour from the pancreas.

Organisation of report

1.   Introduction (Include relevant literature) 1.1. Background (e.g.

1.2. Motivation and rationale

1.2.1.Why is this an important problem in healthcare? 1.2.2.Problem definition

2.   Methodology (Note: There must be enough detail for reproducibility)

2.1. Data description

2.1.1.Data size and labels 2.1.2.Class balance

2.2. Data Pre-processing (e.g. normalisation, resizing, filtering, etc) 2.2.1.Visualise

2.3. Description of networks:

2.3.1.UNet + objective function

2.3.2.Your alternative model (e.g. finetuned model, 3D CNN) or UNet extension

2.4. Describe training protocol

2.4.1.Train/validation split

2.4.2.Strategy to counter data imbalance 2.4.3.Strategy for preventing overfitting

2.4.4.Other settings e.g. optimiser, learning rate, initialisation, etc 2.4.5.Augmentation protocol

2.4.6.Hyperparameter tuning

2.5. Performance evaluation metrics, description and justification

3.   Results

3.1. Training and validation curves and performance metrics 3.2. Test set performance

3.3. Visualise best and worse predictions in images with large target area and small/no target area

3.4. Comparison of models

4.   Discussion

4.1. Evaluate the performance of your models and compare

4.2. Discuss limitations of your analysis and model implementations

4.3. Discuss possible future directions to address the limitations identified

5.   Extra points for distinction (there are no extra points for having more than 2 models)

5.1. High accuracy and precision of segmentation   5.2. Implementing schedulerfor better convergence

5.3. What is the effect of your strategies in balancing classes and image augmentation

Coding

Demonstrate your solution with an attached iPython notebook. Ensure reproducibility and transparency.

Starter code to load images and the baseline UNet segmentation model (using MobileNet as an encoder and pretrained with the Imagenetimages) is provided below.

Starter code to load images

The images provided in this assignment are in the NIfTI format, with the extension of .nii.gz

Use the nibabel package and its load() function to load the files. Then use the get_fdata() function to get the floating point data array.

Note: you may need to first install the nibabel package using

pip install nibabel

""" Use the following code to load the data and labels from the .nii.gz files The file structure should be as follows: data_dir ├─ imagesTest │ ├─ pancreas_100.nii.gz │ ├─ ... │ └─ pancreas_104.nii.gz ├─ imagesTrain │ ├─ pancreas_001.nii.gz │ ├─ ... │ └─ pancreas_055.nii.gz ├─ labelsTest │ ├─ pancreas_100.nii.gz │ ├─ ... │ └─ pancreas_104.nii.gz └─ labelsTrain ├─ pancreas_001.nii.gz ├─ ... └─ pancreas_055.nii.gz You may need to first install nibabel using the following command: pip install nibabel """ importnibabelasnib importos # load data data_dir='/path/to/data' # .nii.gz files tr_files= [itemforiteminos.listdir(data_dir+'/imagesTrain') ifitem.endswith('.nii.gz')] te_files= [itemforiteminos.listdir(data_dir+'/imagesTest') ifitem.endswith('.nii.gz')] train_data= [] train_label= [] i=1 foritem_iintr_files: train_image=nib.load(data_dir+'/imagesTrain/'+item_i) train_data.append(train_image.get_fdata()) label_image=nib.load(data_dir+'/labelsTrain/'+item_i) train_label.append(label_image.get_fdata()) print('Loaded '+str(i) +' of '+str(30)) i+=1 test_data= [] test_label= [] i=1 foritem_iinte_files: test_image=nib.load(data_dir+'/imagesTest/'+item_i) test_data.append(test_image.get_fdata()) label_image=nib.load(data_dir+'/labelsTest/'+item_i) test_label.append(label_image.get_fdata()) print('Loaded '+str(i) +' of '+str(5)) i+=1

Baseline pre-trained model (UNet with timm-mobilenetv3_small_075 as encoder)

For more information:https://github.com/qubvel/segmentation_models.pytorch Note: you may need to first install the segmentation-models-pytorchpackage using

pip install segmentation-models-pytorch

importsegmentation_models_pytorchassmp model=smp.Unet( encoder_name="timm-mobilenetv3_small_075", encoder_weights="imagenet", in_channels=1, classes=1, ) model=model.to(device)

Memory requirement

The original images are being used in this assignment to enable you to experience a more realistic case study of using CNN for medical image analysis. However, the memory

requirement to process these images is thus quite large. It is recommended that you have

access to at least 12GB of RAM (either locally or via Google Colab) but preferably >16GB of RAM available to conduct this analysis.

You will not be penalised for applying image transforms that reduce the image size or

processing aonly subset of the data, but your methods and justification should be reported and you should attempt to maximise the achievable performance.

Remember to use del to remove variables that are no longer required in your analysis to free up memory.

Glucose prediction using time series

Data:

1. Introduction:

For Type 1 Diabetes (T1D), since the body cannot produce sufficient insulin, individuals must carefully balance their insulin doses with their carbohydrate intake to maintain blood glucose   levels within the target range (70 to 180 mg/dL).Proper management of both carbohydrate and insulin intake is essential for maintaining optimal glucose levels and preventing complications associated with T1D. Carbohydrates in meals raise blood glucose levels, while insulin lowers them by allowing glucose to enter the cells. This balance is vital to avoid both  hyperglycemia (high blood sugar>180 mg/dL) and hypoglycemia (low blood sugar< 70 mg/dL), which can lead to acute and long-term complications.

A randomized noninferiority clinical trial was conducted at 14 sites within the T1D Exchange Clinic Network. The study included participants aged 18 years and older (average age 44 ±  14 years), who had T1D for at least one year (average duration 24 ± 12 years), used an insulin pump, and had an HbA1c level of 9.0% or lower (≤75 mmol/mol) (average 7.0 ± 0.7% [53 ± 7.7 mmol/mol]). Before the study, 47% of participants were using continuous glucose monitoring (CGM), a system that automatically tracks glucose levels throughout the day and   night. Participants were randomly assigned in a 2:1 ratio to either the CGM-only group (n =     149) or the CGM plus blood glucose monitoring (BGM) group (n = 77), where BGM involves traditional fingerstick tests to measure blood glucose levels.

2. Data format of different variables

Glucose level readingswhich are recorded every a few minutes. (Data Tables/HDeviceCGM.csv)

Capillary BG, also called “finger stick”, is measured by a glucose meter. (Data Tables/HDeviceBGM.csv)

Self-reported food intake. (Data Tables/HDeviceWizard.csv) <Bolus>Dose of insulin injection. (Data Tables/HDeviceBolus.csv)

We recommend to join tables using “ DeviceDtTmDaysFromEnroll”, “ DeviceTm” and “PtID” .

DeviceDtTmDaysFromEnroll: It represents the number of days from the patient's enrolment date to the date when the device data was recorded. This field can have negative values,

indicating data recorded before the enrolment date; positive values, indicating data recorded after the enrolment date; and zero, indicating data recorded on the day of enrolment.

For example:

DeviceDtTmDaysFromEnroll of -53 means the data was recorded 53 days before the enrolment date.

DeviceDtTmDaysFromEnroll of 31 means the data was recorded 31 days after the enrolment date.

DeviceTm: it represents the time of day when the device data was recorded. This field captures the time in the format of HH:MM:SS, where:

•     HH is the hour in a 24-hour format.

•     MM is the minute.

•     SS is the second.

For example:

•     17:45:16 indicates that the data was recorded at 5:45:16 PM.

•     09:00:26 indicates that the data was recorded at 9:00:26 AM.

PtID:it is used to distinguish and track individual patients within the dataset. Each

patient has a unique PtID that allows for the association of their data across different

records and tables.! We only use the data where PtID<= 10 (7 patients) in this coursework.

An example of the time series data in this coursework:

Acknowledgments

The dataset is a public dataset, so more details can be gotten from the paper [1].

Paper:https://diabetesjournals.org/care/article/40/4/538/3687/REPLACE-BG-A-Randomized- Trial-Comparing-Continuous

Problem formulation

Given the time series data at and before the current time, how to predict the glucose level

after 30 minutes using ML models?Please refer to the practical sessions in Week 6 (RNN) as sample codes.

Organization of the report

1. Introduction

•    Background (e.g., CGM and glucose prediction).

•    Motivation (e.g., why it is an important problem, concepts of hypoglycaemia and hyperglycaemia, time in target range).

•    Related literature (e.g., what has been done by other relevant work?).

•    Show a sound interpretation of the medical problem.

2. Methodology

2.1 Dataset and Preprocessing:

Outline and explain the datasetand its statistics. Provide descriptive statistics of the variables. Visualize the data or feature space and explore the underlying characteristics.  Calculate the   time proportion in the target range (70-180 mg/dL), in hypoglycaemia (below 70 mg/dL) and in hyperglycaemia (above 180 mg/dL). Statistics of meal and insulin distribution.

•    Please retain the last 20% of data as the testing data for each person.

•    Please describe the details of preprocessing the data. How do you join different

tablesand align the timestamp? How do you generate training, validation, and testing examples?

•    How to deal with missing values of glucose data?

•    How to organize and exploit multiple variables?

•    Normalizing the input and output in a suitable way.

•    Other techniques have been properly used here to improve the accuracy with explanations in detail.

2.2 Two methods

•    To compare two algorithms, one could be a baseline (e.g., SVR and XGBoost), and

the other one should be advanced ML techniques (CNN, RNN, etc). Ideally, both two algorithms can be advanced ML models.

•    Choose a proper cost function. Important hyper-parameters for ML models should be mentioned.

•    The training and tuning process (hyper-parameters, training loss) should be demonstrated. Try to achieve the best performance you can.

•    How to explore different impacts of different variables?

3. Results:

•    RMSE and MARD could be used as metrics to assess the resultsusing a sliding window.

•    Results are supposed to be demonstrated clearly, and they could be illustrated by figures (time series) and tables (performances).

•    Visualize algorithms using glucose curves; provide analysis of algorithms with different hyper-parameter settings and give a description of the results.

4. Discussion:

•    Compare the algorithms in terms of their overall performances. Please show reasons to explain your observations.

•    Which variables are important for T1D prediction, and how important are they?

•    Discuss how different preprocessing and/or ML methods affect the performance.

5. Extra points for distinction:

•    Good predictive accuracy and clear execution.

•    Is the sampling frequency of CGM strictly 5 minutes, and how do you handle it?

•    How do you deal with outliers in variables?

•    How do you deal with missing values in evaluating predictive accuracy?

•    Can data for patient A be used for training a model for patient B? If yes, how to do it? If not, why?

•    Population Models vs. Personalized Models: Population models are created by

combining the training data from all patients. Personalized models are developed using each patient's own data. Which approach is better? Additionally, if we personalize population models by fine-tuning them with individual patients' data, which method yields the best results?

•    What kinds of metrics are suitable for this study, besides RMSE and MARD?

Coding

Demonstrate your solution with an attached iPython notebook. Ensure reproducibility and transparency

Reference:

[1] Aleppo, Grazia, et al. "REPLACE-BG: a randomized trial comparing continuous glucose monitoring with and without routine blood glucose monitoring in adults with well-controlled type 1 diabetes." Diabetes care 40.4 (2017): 538-545.

AMLH 2024 – NLP coursework: Identify phenotypes from clinical notes

1. Dataset:

1. Description:

This dataset is a subset (1,000 random samples) of a synthetic clinical note corpus

(https://huggingface.co/datasets/starmpcc/Asclepius-Synthetic-Clinical-Notes), which was generated by using ChatGPT and used for training Asclepius

(https://arxiv.org/abs/2309.00237) – a medical large language model. The documents have been annotated using results from an NLP tool (SemEHR, Wu el al. 2018.doi:

10.1093/jamia/ocx160).

The dataset includes:

-     1,000 documents in plain text format.

-     A csv file of annotations on these documents.

2. Data Format

For the 1,000 documents plain text files: Each file is named in the format of xxx.txt, where xxx is patient IDfrom the original dataset.

For the annotation CSV file:File name: amlh_coursework_2024_annotated.csv

Total number of annotations: 13,737, which are split into two categories as described below.

Class Label Number Description Phenotypes 6,561 diseases, syndromes, and symptoms Therapeutic or Preventive Procedure 7,176 Procedures, drugs, and treatments - The data format is as follows (see the screenshot). file - the file name; start - the annotation start offset (position by characters from the start of the file); end - the annotation end offset; text - the annotated text (as it is in the original document, i.e., doc_str[start:end]); class - the type of the mention.

Acknowledgments

https://huggingface.co/datasets/starmpcc/Asclepius-Synthetic-Clinical-Notes

2. Problem formulation:

The aim is to accurately identify mentions of diseases, syndromes and symptoms from

clinical notes. We call them “Phenotypes” mentions. You are not starting from scratch to

develop an NLP model to do this. Instead, you are given a list of annotations on these

documentsfrom abaseline NLP tool. We assume this baseline result has very high recall, i.e., identify almost all “Phenotypes” mentions. But it cannot accurately tell you which mentions  are true “Phenotypes”. The task is to do a binary classification on all given mentions to

identify true “Phenotypes” annotations. This is a simplified named entity recognition task on clinical notes.

3. Requirements on coursework report:

The report typically is composed of the following sections. Please read carefully about what is expected for each section.

Introduction

Background knowledge and a proper literature review are in section. Why NLP (e.g.,named   entity recognition) from free-text clinical notes is an important problem, and what techniques (ML and non-ML) have been used to solve such a problem in recent years. Give an overview of what you plan to do and how that sits within the recent developments from the literature.

Preprocessing:

•    How to tokenize and lemmatize texts (you need to justify whether lemmatization is needed or not)

•    How to convert texts into a data format that can be used in ML models

•    Generation of training and testing set. If you need a validation set, please generate it

as well. It is also possible the no training set is needed for certain approaches, for example, you choose to use so-called prompting methods to large language models[1].

Methodology

•    How to define the context for given mentions, e.g., word windows as the context and how many words? Word window on one side or both sides?

•    How to represent the features, e.g., whichdistributed representations for representing the meanings of word tokens?

•    Which classification algorithms to use? You will need to compare twomethods with    at least one being neural network-based approach. Options includebut not limited to:

o Represent each annotation as a vector or a sequence of vector, then feed them intoa classification algorithm (e.g., Random Forest or LSTM).

o Use a pretrained language model and fine-tuning the model for the classification task.

o Prompt a pretrained language model (preferably generative models) for the    classification task. Different prompt strategies are recommended to show the differences.

•    Choose proper cost function, hyper-parameters, or different prompting approaches.

•    The training and tuning process should be demonstrated.

Results analysis

•    Please select appropriate metrics to assess the results.

•    Results are supposed to be demonstrated clearly, and key results should be illustrated by figures and tables.

•    Different approaches or choices of technique alternatives should be analysed with comparisons or ablation studies.

Discussion

•    Compare two algorithms in terms of their overall performances and your interpretations.

•    Discuss how different preprocessing methods or hyperparameter settings affect the performance.

•    What are the challenges and opportunities of using free-text data for supporting research and improving health care?

Reference

[1] Liu, Pengfei, et al. "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." ACM Computing Surveys 55.9 (2023): 1-35.


站长地图