辅导COMP5046辅导asp语言程序、Java语言辅导

- 首页 >> CS

 

COMP5046 Natural Language Processing 

 

*This sample exam paper contains 1 or 2 exam questions for each weekly topic, 

and it is for sharing the structure and style of the final exam. You can see which 

can be 2 marks and 5 marks questions. However, it does not expect you to finish 

it in 1 hour, like the final exam. 

 

*The final exam will be an open-book, unsupervised exam. 

 

Week 1. Count-based Word Representation 

 

Q1. Calculate the TFIDF for the terms listed below for documents 1 to 4. There are 

10,000 documents in a collection. The number of times each of these terms occur in 

documents 1 to 4 as well as the number of documents in the collections are stated in the 

following table. Use this information to calculate the TFIDF scores. ​(2 marks) 

 

Number of documents containing terms: 

● machine: 3 

● university: 50 

● headphone: 10 

● perfume: 3 

 

Term frequency (The number of times each of 4 terms occur in document 1 to 4) 

Documents 

Doc1 Doc2 Doc3 Doc4 

machine 8 10 0 0 

university 3 1 2 2 

headphone 0 0 8 7 

perfume 2 2 2 9 

 

Solution 

IDF calculation 

● machine : 3 ​(IDF = log(10000/3) ≈ 8.11) 

● University: 50 ​(IDF = log(10000/50) ≈ 5.30) 

● headphone: 10​ (IDF = log(10000/10) ≈ 6.91) 

● perfume: 3 ​(IDF = log(10000/3) ≈ 8.11) 

 

TFIDF calculation 

Documents 

Doc1 Doc2 Doc3 Doc4 

machine 8.11 * 8 = 64.88 8.11 * 10 = 81.10 0 0 

university 5.30 * 3 = 15.90 5.30 * 1 = 5.30 5.30 * 2 = 10.60 5.30 * 2 = 10.60 

headphone 0 0 6.91 * 8 = 55.28 6.91 * 7 = 48.37 

perfume 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 9 = 72.99 

 

Week 2. Word Embeddings and Representation 

 

Q2. Illustrate an advantage of FastText over Word2Vec. Give application examples to 

support your argument. ​(2 marks) 

 

Solution 

The main advantage of FastText embeddings over Word2Vec is to take into account the 

internal structure of words while learning word representations, which could be very useful 

for morphologically rich languages, and also for words that occur rarely. For example, for a 

word that appears in various forms such as “teaching”, “teacher” and “teached”, the 

internal structure of each word can be learned and represented using n-gram based FastText 

embeddings. 

 

 

 

Q3. Illustrate 2 examples of how we can evaluate word vectors. For each example, 

please indicate whether it is intrinsic or extrinsic. ​(2 marks) 

 

Solution 

● Intrinsic: Word Vector Analogies ; Word vector distances and their correlation with 

human judgments; Word clustering and categorization 

● Extrinsic: Name entity recognitions: finding a person, location or organization and so 

on; Various text classification tasks such as sentiment analysis we did in assignment 

1: identifying and extracting subjective information in the source text or document. 

 

 

 

 

Week 3 and 4. Word Classification with Machine Learning 

 

Q4. ​In class, we learned that the family of recurrent neural networks have many important 

advantages and can be used in a variety of NLP tasks. For each of the following tasks and 

inputs, state how you would run an RNN to do that task. ​(5 marks) 

1. how many outputs i.e. number of times the softmax is called from your RNN. If (t)y︿ 

the number of outputs is not fixed, state it as arbitrary 

2. what each is a probability distribution over(t)y︿ 

3. which inputs are fed at each time step to produce each output 

 

Task A: Named-Entity Recognition: ​For each word in a sentence, classify that word as 

either a person, organization, location, or none. (Inputs: A sentence containing n words) 

 

Task B: Sentiment Analysis: ​Classify the sentiment of a sentence ranging from negative to 

positive (integer values from 0 to 4). (Inputs: A sentence containing n words.) 

 

Solution 

Task A: Named Entity Recognition 

1. Number of Outputs: n outputs 

2. Each is a probability distribution over 4 NER categories.(t)y︿ 

3. Each word in the sentence is fed into the RNN and one output is produced at every 

time step corresponding to the predicted tag/category for each word. 

Task B: Sentiment Analysis 

1. Number of Outputs: 1 output. (n outputs is also acceptable if it takes average of all 

outputs) 

2. Each is a probability distribution over 5 sentiment values.(t)y︿ 

3. Each word in the sentence is fed into the RNN and one output is produced from the 

hidden states (by either taking only the final, max, or mean across all states) 

corresponding to the sentiment value of the sentence. 

 

Q5. Assume that you build a sentiment analysis system that feeds a sentence into a 

RNN, and then computes the sentiment class between 0 (very negative) and 4 (very 

positive), based only on the final hidden state of the RNN. Ilustrate one advantage that 

an RNN would have over a neural window-based model for this task. ​(2 marks) 

 

Solution 

There are multiple answers: We can process arbitrary length inputs. It can encode temporal 

information.(’Take ordering into consideration’ is only partially correct, because theoretically 

window-based models also can, although hard to). Shared weights. Less parameters. The 

number of parameters would increase proportional to the input size of the neural 

window-based network whereas it would stay constant for RNNs since weights are shared at 

every time-step. 

Week 5. Language Fundamental 

 

Q6. Describe the difference between lemmatization and stemming. Give application 

examples to support your argument ​(2 marks) 

 

Solution 

Stemming is a procedure to reduce all words with the same stem to a common form whereas 

lemmatization removes inflectional endings and returns the base or dictionary form of a 

word. For example, words “trouble”, “troubling” and “troubled” may be stemmed to be 

“troubl” (not a valid English word) but will be lemmatized to be “trouble” for comparison. 

Also, another good example for lemmatization would be words “was”, “is” to be mapped to 

“be”. 

 

Week 6. Part of Speech Tagging 

 

Q7. Find one tagging error in each of the following sentences that are tagged with 

the Penn Treebank tagset. Briefly explain why. ​(2 marks) 

1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN 

2. Can/MD I/PRP serve/VB the/DT dinner/NNS 

 

Solution 

1. Atlanta is NNP. Atlanta is the capital city of the U.S. state of Georgia so it should be a 

Proper noun, singular, a name used for an individual person, place, or organization, 

spelled with an initial capital letter. 

2. Dinner is NN. NN represents Noun, singular, whereas NNS is a Noun, plural. The 

word ‘dinner’ is a noun, singular. 

 

 

Q8-a. A hidden markov model includes states, observations, transition probabilities, 

observation likelihoods. Describe what each one of these would correspond to when 

using an HMM for POS tagging. (2 marks) 

 

Solution 

● States: The POS tags at specific points in the sentence. 

● Observations: The words that are observed as the sentence is read in. 

● Transition probabilities: the probability of finding POS tag N following POS tag N-1 

● Observation likelihoods: the probability of seeing a particular word 

 

Q8-b. Given the sentence ``I promise to back the bill.’’ show how you would compute 

the probability of ``back’’ as a verb versus the probability of ``back’’ as a noun using 

the probabilities in Tables a and b using the Viterbi algorithm. You are given the values 

for the third column of the Viterbi table which correspond to observation 3 or ``to’’. 

They are VB: 0, TO: .00000018, NN: 0, PRP: 0. Thus, you will show two computations 

both of which will use these values. You do not need to do the arithmetic; just show the 

formula that would be computed. (3 marks) ​(5 marks) 

(*assume all verb tags as VB) 

Table a. ​Observation Likelihoods 

I promise to back 

VB 0 .0093 0 .00008 

TO 0 0 .99 0 

NN 0 .0085 0 .00068 

PRP .37 0 0 0 

Table b. ​Tag transition probabilities. 

VB TO NN PRP 

.019 .0043 .041 .067 

VB .0038 .035 .047 .0070 

TO .83 0 .00047 0 

NN .0040 .016 .087 .0045 

PRP .23 .00079 .0012 .00014 

 

 

Solution 

● back as a verb: 

.00000018 * Prob(VB| TO) * Prob (VB| back) = .00000018 *.83 * .00008 

● back as a noun: 

.00000018* Prob (NN | TO) * Prob (NN | back) = .00000018 * .00047 * .00068 

 

Week 7. Dependency Parsing 

 

Q9. State a sequence of transitions that make an transition-based dependency parser 

produce the following dependency tree. Explain how to get the sequence of transitions. 

(5 marks) 

 

 

 

Solution 

Suppose SH = Shift, RA = Right Arc, LA = Left Arc. 

SH SH SH SH RA SH SH LA RA RA RA 

 

In order to get this dependency tree using the arc-standard algorithm, we need to do the 

following steps based on the three possible transactions (SH, RA, LA): 

Step 1. SH the ROOT 0 to the stack while all the other words from 1 to 5 will be in the buffer 

as our initial state. 

Step 2. SH the 1 from buffer to the stack 

Step 3. SH the 2 from buffer to the stack 

Step 4. SH the 3 from buffer to the stack 

Step 5. RA from 2 to 3 and remove 3 out of stack 

Step 6. SH 4 from buffer to the stack 

Step 7. SH 5 from the buffer to the stack 

Step 8. LA from 5 to 4 and remove 4 out of stack 

Step 9. RA from 2 to 5 and remove 5 out of stack 

Step 10. RA from 1 to 2 and remove 2 out of stack 

Step 11. RA from 0 to 1 and remove 1 out of stack 

 

Head and modifier refer to the two words in a dependency relation where the head is the one 

that is governor, parent and the modifier is the one that is dependen, daughter. Using the 

arrow for the dependencies, it will point from the head to the modifier. For example, 

considering the dependency of words ‘red hat’, the red will be the modifier while the hat will 

be the head. And the arrow will point from “hat” to “red” in this case. 

 

And, please put a detailed explanation on how to get the dependency tree using SH, LA, RA 

definition and what the head and modifier are. 

Week 8. Language Model and Natural Language Generation 

 

Q10. During training a neural language model, we normally apply teacher forcing. 

Describe what the teacher forcing technique is. Give application examples to support 

your argument. ​(2 marks) 

 

Solution 

Teacher forcing is the technique where the target word is passed as the next input to the 

decoder. Let us assume we want to train a sentence generation model, and the ground truth 

caption for the above image is “Two people reading a book”. Our model makes a mistake in 

predicting the 2nd word and we have “Two” and “birds” for the 1st and 2nd prediction 

respectively. If we use Teacher Forcing, we would feed “people” to our RNN for the 3rd 

prediction, after computing and recording the loss for the 2nd prediction. 

 

(optional: Without Teacher Forcing, we would feed “birds” back to our RNN to predict the 

3rd word. Let’s say the 3rd prediction is “flying”. Even though it makes sense for our model 

to predict “flying” given the input is “birds”, it is different from the ground truth.) 

 

Week 9. Named Entity Recognition and Coreference Resolution 

 

Q11. The IOB format categorizes tagged tokens as I, O and B. Why are three tags 

necessary? What problem would be caused if we used I and O tags exclusively? Give 

application examples to support your argument. ​(2 marks) 

 

Solution 

The IOB format (short for inside, outside, beginning) is a common tagging format for tagging 

tokens in a chunking task. If two chunks follow each other, it would not be possible to make 

clear that they are two chunks instead of one chunk consisting of two words and also not 

where the first ends and the second begin. For example, considering the NER tags using only 

IO format for the sentence ‘Tom/I-PER, Amy/I-PER and/O Tony/I-PER went/O to/O 

Usyd/I-LOC’, the two words Tom and Amy cannot be distinguished as separate two chunks 

as expected. However, this can be solved by using the IOB format as ‘Tom/B-PER, 

Amy/B-PER and/O Tony/B-PER went/O to/O Usyd/B-LOC’.s 

 

Week 10. Attention and Reading Comprehension 

 

Q12. Describe the main intuition behind attention in a neural network model. ​(2 marks) 

 

Solution 

Attention gives us a fixed-sized summary (as a weighted sum) of an arbitrary set of 

representations (the values), dependent on some other representation (the query). The 

weighted sum is a selective summary of the information contained in the values, where the 

query determines which values to focus on. 

 

 

 

 

 

 

Week 11. Transformer and Machine Translation 

 

Q13. What is the motivation for using an attention-only model? ​(2 marks) 

 

Solution 

In an RNN, sequential and recursive computation prevents parallelization. However, an 

attention module allows us to search over any internal state of a model, so perhaps we do not 

need the RNN. 

 

 

 

Q14. Explain the motivation of a positional encoding as used in a Transformer network. 

(2 marks) 

 

Solution 

Position and order of words are the essential parts of any language. They define the grammar 

and thus the actual semantics of a sentence. The Transformer architecture ditched the 

recurrence mechanism in favor of multi-head self-attention mechanism. As each word in a 

sentence simultaneously flows through the Transformer’s encoder/decoder stack, the model 

itself doesn’t have any sense of position/order for each word. Consequently, there’s still the 

need for a way to incorporate the order of the words into our model. 

A positional embedding is the solution to give the model some sense of order is to add a piece 

of information to each word about its position in the sentence. 

站长地图