NLP tasks and evaluation Flashcards

Lecture 1 (24 cards)

1
Q

What is NLP?

A

Making the machine understand human language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is text classification? Give examples

A

Assigning some label to text
- Sentiment classification (positive or negative text)
- SNLI (labeled by experts in contrary to IMDB dataset) = sentence-pair classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is SNLI?

A

standard (Stanford) natural language inference

We have two sentences and we have to assign label to them: entailment, contradiction or neutral. SNLI is a dataset from Stanford, about 570k sentence pairs manually labeled for balanced classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to measure task subjectivity and annotation quality?

A

Using inter-annotator agreement = take chance agreement into account. There are some metrics like Cohen’s Kappa or Scott’s Pi which are very pessimistic measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the gold standard? How is it produced?

A

This is the true label, usually labeled by some experts. It is super costly. It is calculated by majority voting between multiple annotators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is NER?

A

Named Entity Recognition: find entities withing text of predefined types (entities). When having multi-token entities of the same type, we use BEGINNING tag to show the start of a new entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is SuperGLUE?

A

It is a popular benchmark collection of various tasks in English (9 hard NLP tasks) used for any method/model capable of being applied to a broad range of language understanding tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is RTE?

A

Recognizing Textual Entailment (RTE) is a binary classification - whether or not the text entails the hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Coreference resolution task

A

Winograd Schema Challenge: Examples with a pronoun and a lot of noun phrases from the sentence, and the task is to determine the correct referrent of the pronoun (who is he?). Requires everyday knowledge to solve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is BoolQ task?

A

We have a yes/no question and a short passage and we have to determine the answer to the question (yes/no). Requires difficult entailment-like inference, complex, can be ambigous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the MultiRC task

A

Multi-Sentence Reading Comprehension:
Each example consists of context paragraph, question about the paragraph and a list of possible answers (true/false). Task is to label each answers as TRUE or FALSE (there can be multiple TRUE ones)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the SQuAD 2.0 task

A

We have a text and a list of unanswerable questions with plausible but incorrect answers in the text. The model should tell you that it is not possible to know the answer from this text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some text generation tasks in NLP?

A

Machine translation (from language 1 to language 2). Difficult because people can disagree if the translation is accurate or not

Document summarization: given some text, make a short summary

PersonaChat: There are multiple personas with some descriptions and the model should continue the converstaion between these two personas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is classification as generation?

A

Nowadays, you can ask the generative model to label some text (classification). “Is the sentiment positive or negative”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is k-fold cross-validation?

A

We have our dataset of N row and we partition the data into K chunks. K-1 of them are used for training and 1 is used for testing (or validation). This is repeated K times with each time a different partition is used for testing, and the final metric is the average of these K trained models. Before taking new partitioning, the model is reset to avoid data leakage (seeing test during the training)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the confusion matrix?

A

It is a 2x2 matrix of TP, TN, FP, FN
- TP: Model correctly classified positive
- TN: Model correctly classified negative
- FP: Model incorrectly classified positive
- FN: Model incorrectly classified negative

17
Q

Explain the metric Accuracy. Why is it not enough to only use Accuracy?

A

It is how many examples the model correctly classified (TP + TN) divided by the number of all examples (TP + TN + FP + FN).

It is bad with huge class imbalances. If there are 1000 true cases and 50 false, if the model only predicts true, we get a high accuracy but the model is deterministic.

18
Q

Explain Precision and Recall metrics

A

Precision is how many of the predicted positives (TP + FP) are correctly predicted (TP) - quality
Recall is how many of the actual positives (TP + FN) are correctly predicted (TP) - quantity

Recall can be very high if we label all examples as positives (all are correctly labeled), but the precision will drop.
If we only predict super obvious ones (low number of them), the precision is high but recall is low.

That’s why we need some combined metric (F1 score)

19
Q

What is F1 score?

A

It is a combination of recall and precision:

2 * (P * R) / (P + R).

It reduces the inconsistancies of precision and recall.

20
Q

How would you calculate Precision, Recall and F1 score?

21
Q

What changes with multi-class classification confusion matrix?

A

Basically we can convert it to 2x2 by considering everything that is classified != true label as negative.

23
Q

How to do evaluation of MT systems?

A

We can use metric like BLEU (Bilingual Evaluation Understudy) for Machine Translation (MT) tasks.

It computes some precision-based metric on n-gram overlap between true and inferred text. In particular, BLEU is the ration of the number of overlapping n-grams to the noral number of n-grams in the hypothesis

Con: translation is ambiguous and doesn’t take recall into account

24
Q

Explain ROUGE evaluation metric

A

Recall-Oriented Understudy for Gisting Evaluation:
Similar to BLEU but it is recall-based instead of precision-based. It counts n-grams as well