代写CS512编程、代做C/C++，Java/Python程序

2023.10.30 - 首页 >> CS

10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki

Fall 2023 CS512 Assignment

Coding Assignment (Distributed Tuesday Sep. 12, 2023, Due Thursday Oct. 26, 2023, Extended to Sunday
Oct. 29, 2023)

Before Dive in

All answers must be in pdf format.
This is an individual assignment. You can discuss this assignment in Piazza, including the performance your implementation can achieve on the provided dataset, but please do not
work together or share codes.
Reference libraries or programs can be found online.
You can use C/C++ or Java or Python2/3 as your programming language. A more detailed project organization guidance can be found at the end of the assignment.

Late policy:

10% off for one day (Oct. 27th, 11:59PM)(Oct. 30th, 11:59PM)
20% off for two days (Oct. 28th, 11:59PM) (Oct. 31st, 11:59PM)
40% off for three days (Oct. 29th, 11:59PM) (Nov. 1st, 11:59PM)
A section titled Frequently Asked Questions can be found at the end of the assignment, and we will keep updating it regarding questions in Piazza and office hours.
Please first read through the entire assignment description before you start.

Problem Description

As we learned from the class, traditional topic modeling could suffer from non-informative topics and overlapped semantics between different topics. As a response, discriminative topic mining incorporates user guidance as category name and retrieve representative and discriminative phrases during embedding learning. In the meantime, these category-name guided text
embeddings can be further utilized to train a weakly-supervised classifier in high quality.
Specifically, we need to finish four steps as follows.
Step 1: Download training dataset on news and movies(see links in Step 4), use AutoPhrase to extract high quality phrases.
Step 2: Write or adopt CatE on segmented corpus to find representative terms for each category.
Step 3: Perform weakly-supervised text classification with only class label names or keywords. Test the classifier on two datasets.
Step 4: Investigate the results and propose your way of using prompting of pre-trained language models to improve it. Implement your method and compare it with the one in Step 3.
Step 5: Submit your implementation, results, and a short report to Canvas.

Problem Data

You can find the problem data in this link. Here are the detailed files when you downloaded the data.

Name Num of
documents
Category names Training Text Testing Text #Validation Labels

News 120000 news_category.txt news_train.txt news_test.txt first 100 of
news_train_labels.txt
Movies 25000 movies_category.txt movies_train.txt movies_test.txt first 100 of movies_train_labels.txt

Step 1: Adopt AutoPhrase to extract high quality phrases

In this step, you will need to utilize AutoPhrase to extract high quality phrases in train.txt of both datasets provided. The extracted phrase list look like (the example here is different from
homework test data):

Score Phrase

0.9857636285 lung nodule
0.9850002116 presidential election
0.9834895762 wind turbines
0.9834120003 ifip wg
....

Step 2: Compute category name guided embedding on segmented corpus

Use your segmentation model to parse the same corpus, recommended parameters for segmentation is HIGHLIGHT_MULTI=0.7 HIGHLIGHT_SINGLE=1.0. An example segmented
corpus can be:

An overview

is presented of the use of

spatial data structures

spatial databases

. The focus is on

hierarchical
data structures

, including a number of variants of

quadtrees

, which

sort

the data with respect to the space occupied by it. Such
techniques are known as

spatial indexing

methods.

Hierarchical data structures

are based on the principle of

recursive
decomposition

.
Then you will write your own CatE or refer the existing ones to compute the phrase embedding as well as category-guided phrase mining. You will need to submit category and their top-10
representative terms in {category_name}_terms.txt.

For example, in technology_terms.txt, you want have the first line as category name embedding, the following 10 lines would be category representative term embeddings

technology 0.720378 -0.312077 0.811608 ... 1.096724

terms_of_usage_privacy_policy_code_ 1.439691 0.508672 -0.958150 ... -1.277346
...
10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki
2/3

Tip: You can concatenate train.txt and test.txt into a larger corpus for phrase mining and category-named guided embeddings.

Step 3: Document Classification with CatE embeddings

In this step, you will need to build a weakly-supervised classifier (e.g. WeSTClass, LOTClass) on top of the term embeddings or topic keywords you obtained from the previous steps. The
only supervision is category names provided in the datasets.
For example, in news, we have following category names in news_category.txt
politics
sports
business
technology
To help validate your results, we also provide labels of first 100 documents in both dataset in news_train_labels.txt and movies_train_labels.txt. Feel free to discuss the validation
performance you get in this step on Piazza.

Tip: You can try label names or expanded keywords from the cate embeddings as weak supervision. We suggest you try both ways and report the better one.

Step 4: Improving Your Classifiers with Prompting of PLMs

Because existing methods (e.g., WeSTClass, LOTClass) use keyword matching or static token embeddings to generate pseudo labels for classifier training, using PLM prompting can
potentially improve the pseudo label quality with the contextualization power of PLMs. In this step, you will need to propose and implement your idea on how to leverage PLMs prompting
to improve the classifiers you get in the last step. Feel free to explore different types of models, such as MLM-based PLMs (BERT, RoBERTa), discriminative PLMs (ELECTRA), fine-tuned models like RoBERTa-MNLI, or large ones like ChatGPT (sorry we cannot provide access to OpenAI API).
You can either propose a completely new method or improve based on the one you used in step 3, and please make sure it is weakly-supervised, i.e., not using any labels. You may refer to
some recent papers to borrow some ideas:
Zhao et al., Pre-trained Language Models Can be Fully Zero-Shot Learners, in ACL 2023.
Park and Lee, LIME: Weakly-Supervised Text Classification Without Seeds, in COLING 2022.
Zhang et al., PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training, arXiv:2305.13723.
Sun et al., Text Classification via Large Language Models, arXiv:2305.08377.

Do not simply use their code available on GitHub. In this step, we expect you to propose your own idea and implement it by yourself. You also need to test your
implementation on the two provided datasets and write a report.

Things to include in your report:
Describe your proposed idea to the level of details that others can reproduce your results
In one table, compare the prediction accuracy of your new classifier and the one in Step 3 on the validation samples for both datasets, News and Movies.
Provide analysis on your experimental results (add additional experiments or case studies if necessary) to explain why your idea can or cannot improve the performance.

Step 5: Submit your result

In this step, you will apply methods you implemented in Step 2-4 on two real-world datasets.
In your submission, you should include these files in a .zip:
yournetid_assignment.zip/
|----------------report.pdf
|----------------code/
|---------------- category_guided_classification/ (everything your implemented in step 4 and a readme)
|----------------data/
|----------------movies/
|----------------First 100 documents in news_train.txt after phrasal segmentation as: train_phrase.txt (Please do not submit all documents)
|----------------Top-10 embeddings for each category and category name embeddings as: good_terms.txt, bad_terms.txt (Totally 11 lines per file, with category name
in the first line)
|----------------Your classification results from Step 3 as: step_3_test_prediction.txt (it should have same lines as testing file with one predicted label ID in each line)
|----------------Your classification results from Step 4 as: step_4_test_prediction.txt (it should have same lines as testing file with one predicted label ID in each line)
|----------------news/ ... similar as movies
Your submissions will be evaluated in follow aspects:
1. Segmented Corpus have a good amount of quality phrases. (20pts)
2. Meaningful representative phrases under each category. (20pts)
3. Good document classification results based on your embedding and classification algorithm. (20pts)
4. A clear report on your proposed method with performance comparison and experiment analysis. (30pts)
5. Comprehensive code and instruction to reproduce your results. (10pts)

Double check before you submit

Now you can submit your assignment through Canvas !!
Congratulations, you just finished the programming assignment of CS512!!

Frequently Asked Questions

Having problem running AutoPhrase. Here are several solutions you may try:
10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki
3/3

(1) Use Campus Cluster: Please check the instructions provided on Piazza for how to use it. Everyone in the course has access to the campus cluster (15). You need to request
compute nodes by using the srun/sbatch command to run any script, else it will be killed by the admin automatically. Please see documentation
(https://campuscluster.illinois.edu/resources/docs/user-guide/) on how to use modules and request compute nodes.
(2) Use Google Colab with terminal mode (e.g., lines started with ! will be executed in terminal, or you may already have access to terminal with Pro subscription). See Piazza for more instructions on how to run WeSTClass for Step 3 on Colab.

(3) If you want to use a Mac with ARM chip, you may need to install gcc 11+ instead and solve some dependency issues. Here is a relative
post: https://stackoverflow.com/questions/72758130/is-there-a-way-to-install-and-use-gcc-on-macbook-m1

(4) If you want to use Windows, use WSL Ubuntu (https://learn.microsoft.com/en-us/windows/wsl/install)

Grading for step 3 & 4:
Step 3 will be graded purely on the test performance. We will refer to our runs of WeSTClass on both corpora as standards for this step. You don't need to fix to WeSTClass and can
use other classifiers you want, but notice that your choice of classifier will be used as a baseline in step 4.
Step 4 will be graded on both the classification accuracy and your report. You should propose a way to improve classification accuracy with PLM prompting. Either improve based
on step 3 or propose a completely new method is fine. Besides a clear description of your proposed method and reported performances on dev set in your report, we expect you to
achieve better test performance than your classifier in step 3. If you cannot outperform the classifier in step 3, we will then grade it based on your error analysis and model insights in
your report.

PLMs in step 4:

Because we are focusing on the weakly-supervised setting, please do not use PLMs that are trained with task-related data. For example, directly using a BERT fine-tuned with
sentiment analysis data is not allowed. You should only use LMs that are generically pre-trained or only fine-tuned on other tasks' data that are widely available (e.g., RoBERTa-MNLI
is fine).

No labels