代做DTS406TC Natur al Language Processing Coursework 2帮做Python编程

2025.04.07 - 首页 >> Java编程

DTS406TC Natural Language Processing

Coursework 2 (Individual Assessment)

Due: 5:00 pm China time (UTC+8 Beijing) on April 6, 2025

Weight: 60%

Maximum score: 100 marks (100 individual marks)

Assessed learning outcomes:

C Implement and evaluate different NLP algorithms and models based on performance metrics and real-world application requirements

D Critically review advanced topics in NLP , such as prompt learning, language generation , and natural language understanding

E Demonstrate a strong capability for undertaking individual research on NLP problems

Overview

Question Answer System is defined as given a question, the system is able to correctly seek or generate the corresponding answer. The common question answer systems include Reddit, Yahoo Answer, Quora, Zhihu and so on. The Question Answer System assists the users to quickly receive the answer regarding the question that the user posts.

On the other hand, NLP technologies are widely applied in the medical, legal, and financial domain. Equipped with the cutting-edge NLP algorithms, many problems are well addressed in these domains.

Part One

1. Literature Review on Question Answer System (20 Marks)

a) Overview of the Question Answer System and its applications. Please provide three examples of real-life applications of the Question Answer System. (6 Marks)

b) Please list three key challenges in the Question Answer System. (6 Marks)

c) Please elaborate on two BERT-style. and two ChatGPT-style. approaches to the Question Answer System. Meanwhile, discuss the advantages and disadvantages of each approach. (8 Marks)

2. Data Collection (12 Marks)

Collect two datasets of User-Generated Content (UGC) from platforms such as Reddit or Yahoo Answer, focusing on the Question Answer System in different scenarios. Each dataset should contain a minimum of 3,000 instances. Preprocess the datasets by performing tasks like stopword removal and tokenization. Finally, conduct a statistical analysis of the two datasets(e.g., the word distribution of the corpus). Notice that some UGC data may be downloaded from Kaggle if there are API restrictions preventing direct downloads from social platforms. (6 Marks/dataset x 2=12 Marks)

3. Algorithm Description and Implementation (32 Marks)

a) Choose two approaches for the answer generation in the Question Answer System on each collected UGC dataset: one using a BERT-style approach and the other employing a ChatGPT-style approach. You can directly adopt open-sourced LLMs like Llama and Qwen from Hugging Face. However, to use closed-source LLMs like ChatGPT, you'll need to register and access them via the API. Note that one dataset is designed for answer selection, where a BERT-style. approach will be applied. The other dataset is for answer generation, where a ChatGPT-style approach will be used. Please provide the pseudo-code. (3 Marks/algorithm x 2= 6 Marks)

b) The BERT-style approach should be incorporated with Contrastive Learning(CL), and the ChatGPT-style. approach should be incorporated with Retrieval Augment Generation(RAG). Please provide the pseudo-code. (3 Marks/algorithm x 2= 6 Marks)

c) Develop a Question Answer System for each approach using Python. The implementation pipeline should include the following components: feature engineering (3 Marks), algorithm implementation (6 Marks, with CL for the BERT-style approach and RAG for the ChatGPT-style approach), and metrics computation (1 Marks). (10Marks/algorithm x 2 = 20 Marks)

4. Results Analysis (12 Marks)

a) Provide the results for each approach applied to the two UGC datasets. Select and apply two relevant metrics (e.g., precision and recall) to assess the performance of the implemented models for the answer selection task, and two relevant metrics (e.g., bleu and rouge) to assess the performance of the implemented models for the answer generation task, with each metric worth 2 Marks. (8 Marks)

b) Explain the reasons behind the model performance on each dataset for each approach.

(2 Marks/algorithm+dataset x 2 = 4 Marks)

Part Two

1. Conduct a survey on one of the NLP domains: Medical, Legal, or Financial. (18 Marks)

a) Provide an overview of the chosen domain. Meanwhile, provide three real-life scenarios of the chosen domain. For example, if you select the medical NLP domain, you could outline the problems within this area, such as detecting depression from social data, mining medical records, and generating doctor recommendations from medical forums. Additionally, you could provide a list of commonly used datasets in medical NLP. (6 Marks)

b) Identify and list three key challenges within the domain. (6 Marks)

c) Select one challenge and propose a novel approach to address it. Provide a detailed description of the approach, which will be assessed based on its novelty and correctness. (6 Marks)

Part Three

1. Report Writing (6 Marks)

This coursework evaluates your understanding the challenges of the problem and the correctness of the proposed algorithms. It also tests your professional skills in terminology usage, presentation of algorithms and experimental results, as well as the logical manner of the proposal. (6 Marks)

Submission

You must submit a single zip file. The zip file is named "StudentID_Coursework.zip". It includes: a cover letter with your information and the final PDF report. A folder labeled "algorithms" contains all the model implementations, data preprocessing scripts, and evaluation scripts. A folder labeled "data" contains all the datasets and the experimental results in the CSV format.