NLP tasks and evaluation Flashcards

Question 1

Q

What is NLP?

Answer

A

Making the machine understand human language

Question 2

Q

What is text classification? Give examples

Answer

A

Assigning some label to text
- Sentiment classification (positive or negative text)
- SNLI (labeled by experts in contrary to IMDB dataset) = sentence-pair classification

Question 3

Q

What is SNLI?

Answer

A

standard (Stanford) natural language inference

We have two sentences and we have to assign label to them: entailment, contradiction or neutral. SNLI is a dataset from Stanford, about 570k sentence pairs manually labeled for balanced classification

Question 4

Q

How to measure task subjectivity and annotation quality?

Answer

A

Using inter-annotator agreement = take chance agreement into account. There are some metrics like Cohen’s Kappa or Scott’s Pi which are very pessimistic measures

Question 5

Q

What is the gold standard? How is it produced?

Answer

A

This is the true label, usually labeled by some experts. It is super costly. It is calculated by majority voting between multiple annotators

Question 6

Q

What is NER?

Answer

A

Named Entity Recognition: find entities withing text of predefined types (entities). When having multi-token entities of the same type, we use BEGINNING tag to show the start of a new entity

Question 7

Q

What is SuperGLUE?

Answer

A

It is a popular benchmark collection of various tasks in English (9 hard NLP tasks) used for any method/model capable of being applied to a broad range of language understanding tasks.

Question 8

Q

What is RTE?

Answer

A

Recognizing Textual Entailment (RTE) is a binary classification - whether or not the text entails the hypothesis

Question 9

Q

What is Coreference resolution task

Answer

A

Winograd Schema Challenge: Examples with a pronoun and a lot of noun phrases from the sentence, and the task is to determine the correct referrent of the pronoun (who is he?). Requires everyday knowledge to solve

Question 10

Q

What is BoolQ task?

Answer

A

We have a yes/no question and a short passage and we have to determine the answer to the question (yes/no). Requires difficult entailment-like inference, complex, can be ambigous

Question 11

Q

Explain the MultiRC task

Answer

A

Multi-Sentence Reading Comprehension:
Each example consists of context paragraph, question about the paragraph and a list of possible answers (true/false). Task is to label each answers as TRUE or FALSE (there can be multiple TRUE ones)

Question 12

Q

Explain the SQuAD 2.0 task

Answer

A

We have a text and a list of unanswerable questions with plausible but incorrect answers in the text. The model should tell you that it is not possible to know the answer from this text.

Question 13

Q

What are some text generation tasks in NLP?

Answer

A

Machine translation (from language 1 to language 2). Difficult because people can disagree if the translation is accurate or not

Document summarization: given some text, make a short summary

PersonaChat: There are multiple personas with some descriptions and the model should continue the converstaion between these two personas

Question 14

Q

What is classification as generation?

Answer

A

Nowadays, you can ask the generative model to label some text (classification). “Is the sentiment positive or negative”.

Question 15

Q

What is k-fold cross-validation?

Answer

A

We have our dataset of N row and we partition the data into K chunks. K-1 of them are used for training and 1 is used for testing (or validation). This is repeated K times with each time a different partition is used for testing, and the final metric is the average of these K trained models. Before taking new partitioning, the model is reset to avoid data leakage (seeing test during the training)

Question 16

Q

What is the confusion matrix?

Answer

Study These Flashcards

A

It is a 2x2 matrix of TP, TN, FP, FN
- TP: Model correctly classified positive
- TN: Model correctly classified negative
- FP: Model incorrectly classified positive
- FN: Model incorrectly classified negative

Question 17

Q

Explain the metric Accuracy. Why is it not enough to only use Accuracy?

Answer

Study These Flashcards

A

It is how many examples the model correctly classified (TP + TN) divided by the number of all examples (TP + TN + FP + FN).

It is bad with huge class imbalances. If there are 1000 true cases and 50 false, if the model only predicts true, we get a high accuracy but the model is deterministic.

Question 18

Q

Explain Precision and Recall metrics

Answer

Study These Flashcards

A

Precision is how many of the predicted positives (TP + FP) are correctly predicted (TP) - quality
Recall is how many of the actual positives (TP + FN) are correctly predicted (TP) - quantity

Recall can be very high if we label all examples as positives (all are correctly labeled), but the precision will drop.
If we only predict super obvious ones (low number of them), the precision is high but recall is low.

That’s why we need some combined metric (F1 score)

Question 19

Q

What is F1 score?

Answer

Study These Flashcards

A

It is a combination of recall and precision:

2 * (P * R) / (P + R).

It reduces the inconsistancies of precision and recall.

Question 20

Q

How would you calculate Precision, Recall and F1 score?

Answer

Study These Flashcards

A

Question 21

Q

What changes with multi-class classification confusion matrix?

Answer

Study These Flashcards

A

Basically we can convert it to 2x2 by considering everything that is classified != true label as negative.

Question 22

Q

Answer

Study These Flashcards

A

Question 23

Q

How to do evaluation of MT systems?

Answer

Study These Flashcards

A

We can use metric like BLEU (Bilingual Evaluation Understudy) for Machine Translation (MT) tasks.

It computes some precision-based metric on n-gram overlap between true and inferred text. In particular, BLEU is the ration of the number of overlapping n-grams to the noral number of n-grams in the hypothesis

Con: translation is ambiguous and doesn’t take recall into account

Question 24

Q

Explain ROUGE evaluation metric

Answer

Study These Flashcards

A

Recall-Oriented Understudy for Gisting Evaluation:
Similar to BLEU but it is recall-based instead of precision-based. It counts n-grams as well

NLP tasks and evaluation Flashcards

Lecture 1 (24 cards)