辅导COMP5046辅导asp语言程序、Java语言辅导

2020.07.21 - 首页 >> CS

COMP5046 Natural Language Processing

*This sample exam paper contains 1 or 2 exam questions for each weekly topic,

and it is for sharing the structure and style of the final exam. You can see which

can be 2 marks and 5 marks questions. However, it does not expect you to finish

it in 1 hour, like the final exam.

*The final exam will be an open-book, unsupervised exam.

Week 1. Count-based Word Representation

Q1. Calculate the TFIDF for the terms listed below for documents 1 to 4. There are

10,000 documents in a collection. The number of times each of these terms occur in

documents 1 to 4 as well as the number of documents in the collections are stated in the

following table. Use this information to calculate the TFIDF scores. (2 marks)

Number of documents containing terms:

● machine: 3

● university: 50

● headphone: 10

● perfume: 3

Term frequency (The number of times each of 4 terms occur in document 1 to 4)

Documents

Doc1 Doc2 Doc3 Doc4

machine 8 10 0 0

university 3 1 2 2

headphone 0 0 8 7

perfume 2 2 2 9

Solution

IDF calculation

● machine : 3 (IDF = log(10000/3) ≈ 8.11)

● University: 50 (IDF = log(10000/50) ≈ 5.30)

● headphone: 10 (IDF = log(10000/10) ≈ 6.91)

● perfume: 3 (IDF = log(10000/3) ≈ 8.11)

TFIDF calculation

Documents

Doc1 Doc2 Doc3 Doc4

machine 8.11 * 8 = 64.88 8.11 * 10 = 81.10 0 0

university 5.30 * 3 = 15.90 5.30 * 1 = 5.30 5.30 * 2 = 10.60 5.30 * 2 = 10.60

headphone 0 0 6.91 * 8 = 55.28 6.91 * 7 = 48.37

perfume 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 9 = 72.99

Week 2. Word Embeddings and Representation

Q2. Illustrate an advantage of FastText over Word2Vec. Give application examples to

support your argument. (2 marks)

Solution

The main advantage of FastText embeddings over Word2Vec is to take into account the

internal structure of words while learning word representations, which could be very useful

for morphologically rich languages, and also for words that occur rarely. For example, for a

word that appears in various forms such as “teaching”, “teacher” and “teached”, the

internal structure of each word can be learned and represented using n-gram based FastText

embeddings.

Q3. Illustrate 2 examples of how we can evaluate word vectors. For each example,

please indicate whether it is intrinsic or extrinsic. (2 marks)

Solution

● Intrinsic: Word Vector Analogies ; Word vector distances and their correlation with

human judgments; Word clustering and categorization

● Extrinsic: Name entity recognitions: finding a person, location or organization and so

on; Various text classification tasks such as sentiment analysis we did in assignment

1: identifying and extracting subjective information in the source text or document.

Week 3 and 4. Word Classification with Machine Learning

Q4. In class, we learned that the family of recurrent neural networks have many important

advantages and can be used in a variety of NLP tasks. For each of the following tasks and

inputs, state how you would run an RNN to do that task. (5 marks)

1. how many outputs i.e. number of times the softmax is called from your RNN. If (t)y︿

the number of outputs is not fixed, state it as arbitrary

2. what each is a probability distribution over(t)y︿

3. which inputs are fed at each time step to produce each output

Task A: Named-Entity Recognition: For each word in a sentence, classify that word as

either a person, organization, location, or none. (Inputs: A sentence containing n words)

Task B: Sentiment Analysis: Classify the sentiment of a sentence ranging from negative to

positive (integer values from 0 to 4). (Inputs: A sentence containing n words.)

Solution

Task A: Named Entity Recognition

1. Number of Outputs: n outputs

2. Each is a probability distribution over 4 NER categories.(t)y︿

3. Each word in the sentence is fed into the RNN and one output is produced at every

time step corresponding to the predicted tag/category for each word.

Task B: Sentiment Analysis

1. Number of Outputs: 1 output. (n outputs is also acceptable if it takes average of all

outputs)

2. Each is a probability distribution over 5 sentiment values.(t)y︿

3. Each word in the sentence is fed into the RNN and one output is produced from the

hidden states (by either taking only the final, max, or mean across all states)

corresponding to the sentiment value of the sentence.

Q5. Assume that you build a sentiment analysis system that feeds a sentence into a

RNN, and then computes the sentiment class between 0 (very negative) and 4 (very

positive), based only on the final hidden state of the RNN. Ilustrate one advantage that

an RNN would have over a neural window-based model for this task. (2 marks)

Solution

There are multiple answers: We can process arbitrary length inputs. It can encode temporal

information.(’Take ordering into consideration’ is only partially correct, because theoretically

window-based models also can, although hard to). Shared weights. Less parameters. The

number of parameters would increase proportional to the input size of the neural

window-based network whereas it would stay constant for RNNs since weights are shared at

every time-step.

Week 5. Language Fundamental

Q6. Describe the difference between lemmatization and stemming. Give application

examples to support your argument (2 marks)

Solution

Stemming is a procedure to reduce all words with the same stem to a common form whereas

lemmatization removes inflectional endings and returns the base or dictionary form of a

word. For example, words “trouble”, “troubling” and “troubled” may be stemmed to be

“troubl” (not a valid English word) but will be lemmatized to be “trouble” for comparison.

Also, another good example for lemmatization would be words “was”, “is” to be mapped to

“be”.

Week 6. Part of Speech Tagging

Q7. Find one tagging error in each of the following sentences that are tagged with

the Penn Treebank tagset. Briefly explain why. (2 marks)

1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN

2. Can/MD I/PRP serve/VB the/DT dinner/NNS

Solution

1. Atlanta is NNP. Atlanta is the capital city of the U.S. state of Georgia so it should be a

Proper noun, singular, a name used for an individual person, place, or organization,

spelled with an initial capital letter.

2. Dinner is NN. NN represents Noun, singular, whereas NNS is a Noun, plural. The

word ‘dinner’ is a noun, singular.

Q8-a. A hidden markov model includes states, observations, transition probabilities,

observation likelihoods. Describe what each one of these would correspond to when

using an HMM for POS tagging. (2 marks)

Solution

● States: The POS tags at specific points in the sentence.

● Observations: The words that are observed as the sentence is read in.

● Transition probabilities: the probability of finding POS tag N following POS tag N-1

● Observation likelihoods: the probability of seeing a particular word

Q8-b. Given the sentence ``I promise to back the bill.’’ show how you would compute

the probability of ``back’’ as a verb versus the probability of ``back’’ as a noun using

the probabilities in Tables a and b using the Viterbi algorithm. You are given the values

for the third column of the Viterbi table which correspond to observation 3 or ``to’’.

They are VB: 0, TO: .00000018, NN: 0, PRP: 0. Thus, you will show two computations

both of which will use these values. You do not need to do the arithmetic; just show the

formula that would be computed. (3 marks) (5 marks)

(*assume all verb tags as VB)

Table a. Observation Likelihoods

I promise to back

VB 0 .0093 0 .00008

TO 0 0 .99 0

NN 0 .0085 0 .00068

PRP .37 0 0 0

Table b. Tag transition probabilities.

VB TO NN PRP

.019 .0043 .041 .067

VB .0038 .035 .047 .0070

TO .83 0 .00047 0

NN .0040 .016 .087 .0045

PRP .23 .00079 .0012 .00014

Solution

● back as a verb:

.00000018 * Prob(VB| TO) * Prob (VB| back) = .00000018 *.83 * .00008

● back as a noun:

.00000018* Prob (NN | TO) * Prob (NN | back) = .00000018 * .00047 * .00068

Week 7. Dependency Parsing

Q9. State a sequence of transitions that make an transition-based dependency parser

produce the following dependency tree. Explain how to get the sequence of transitions.

(5 marks)

Solution

Suppose SH = Shift, RA = Right Arc, LA = Left Arc.

SH SH SH SH RA SH SH LA RA RA RA

In order to get this dependency tree using the arc-standard algorithm, we need to do the

following steps based on the three possible transactions (SH, RA, LA):

Step 1. SH the ROOT 0 to the stack while all the other words from 1 to 5 will be in the buffer

as our initial state.

Step 2. SH the 1 from buffer to the stack

Step 3. SH the 2 from buffer to the stack

Step 4. SH the 3 from buffer to the stack

Step 5. RA from 2 to 3 and remove 3 out of stack

Step 6. SH 4 from buffer to the stack

Step 7. SH 5 from the buffer to the stack

Step 8. LA from 5 to 4 and remove 4 out of stack

Step 9. RA from 2 to 5 and remove 5 out of stack

Step 10. RA from 1 to 2 and remove 2 out of stack

Step 11. RA from 0 to 1 and remove 1 out of stack

Head and modifier refer to the two words in a dependency relation where the head is the one

that is governor, parent and the modifier is the one that is dependen, daughter. Using the

arrow for the dependencies, it will point from the head to the modifier. For example,

considering the dependency of words ‘red hat’, the red will be the modifier while the hat will

be the head. And the arrow will point from “hat” to “red” in this case.

And, please put a detailed explanation on how to get the dependency tree using SH, LA, RA

definition and what the head and modifier are.

Week 8. Language Model and Natural Language Generation

Q10. During training a neural language model, we normally apply teacher forcing.

Describe what the teacher forcing technique is. Give application examples to support

your argument. (2 marks)

Solution

Teacher forcing is the technique where the target word is passed as the next input to the

decoder. Let us assume we want to train a sentence generation model, and the ground truth

caption for the above image is “Two people reading a book”. Our model makes a mistake in

predicting the 2nd word and we have “Two” and “birds” for the 1st and 2nd prediction

respectively. If we use Teacher Forcing, we would feed “people” to our RNN for the 3rd

prediction, after computing and recording the loss for the 2nd prediction.

(optional: Without Teacher Forcing, we would feed “birds” back to our RNN to predict the

3rd word. Let’s say the 3rd prediction is “flying”. Even though it makes sense for our model

to predict “flying” given the input is “birds”, it is different from the ground truth.)

Week 9. Named Entity Recognition and Coreference Resolution

Q11. The IOB format categorizes tagged tokens as I, O and B. Why are three tags

necessary? What problem would be caused if we used I and O tags exclusively? Give

application examples to support your argument. (2 marks)

Solution

The IOB format (short for inside, outside, beginning) is a common tagging format for tagging

tokens in a chunking task. If two chunks follow each other, it would not be possible to make

clear that they are two chunks instead of one chunk consisting of two words and also not

where the first ends and the second begin. For example, considering the NER tags using only

IO format for the sentence ‘Tom/I-PER, Amy/I-PER and/O Tony/I-PER went/O to/O

Usyd/I-LOC’, the two words Tom and Amy cannot be distinguished as separate two chunks

as expected. However, this can be solved by using the IOB format as ‘Tom/B-PER,

Amy/B-PER and/O Tony/B-PER went/O to/O Usyd/B-LOC’.s

Week 10. Attention and Reading Comprehension

Q12. Describe the main intuition behind attention in a neural network model. (2 marks)

Solution

Attention gives us a fixed-sized summary (as a weighted sum) of an arbitrary set of

representations (the values), dependent on some other representation (the query). The

weighted sum is a selective summary of the information contained in the values, where the

query determines which values to focus on.

Week 11. Transformer and Machine Translation

Q13. What is the motivation for using an attention-only model? (2 marks)

Solution

In an RNN, sequential and recursive computation prevents parallelization. However, an

attention module allows us to search over any internal state of a model, so perhaps we do not

need the RNN.

Q14. Explain the motivation of a positional encoding as used in a Transformer network.

(2 marks)

Solution

Position and order of words are the essential parts of any language. They define the grammar

and thus the actual semantics of a sentence. The Transformer architecture ditched the

recurrence mechanism in favor of multi-head self-attention mechanism. As each word in a

sentence simultaneously flows through the Transformer’s encoder/decoder stack, the model

itself doesn’t have any sense of position/order for each word. Consequently, there’s still the

need for a way to incorporate the order of the words into our model.

A positional embedding is the solution to give the model some sense of order is to add a piece

of information to each word about its position in the sentence.