Evaluation and Linguistic Resources Flashcards

1
Q

What are extrinsic evalutations?

A

They evaluate the performance of an NLP component by embedding it in an application and measuring how much the whole application improves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is intrinsic evaluation?

A

It measures the quality of an NLP component independent of any application

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does classic AI differ from modern AI?

A

The classic AI is based on patterns, prescriptive grammars, symbolic rules whereas modern AI infers statistical patterns and rules from examining large quantities of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why must we use a test and training split of data?

A

It is to distinguish between signals and noise, in order to check whether your model works outside of training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of data splits can you have?

A

Training Dataset - used to fit the model

Validation Dataset - used to provide evaluation of model fit on training data while tuning hyperparameters

Test Dataset - used to provide evaluation of a final model fit on the training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What size corpora is better?

A

The bigger the corpora, the more varied the language, so the better the training data and more word types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is cross-validation?

A

It is a method of partitioning data for training and testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain how cross validation works

A

You divide data into k-folds. For training we use all but one of the folds and test on the final fold. This is repeated k times, with a different fold for the test each time. Then the average error rate can be calculated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some pros and cons of cross-validation?

A

Pros - less biased error measure compared to a single test set

Cons - can be time consuming when n is large, can be computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the random subsampling variation on cross validation?

A

It is similar to k-fold but for each time, randomly choose a proportion of dataset to be the test set. The pros are that it is not dependent on the number of iterations and may be more robust to selection bias. The cons are that some points may never be selected or be selected multiple times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the ROUGE metric?

A

It is a text summarisation metric. It measures machine summaries against a gold standard set of summaries from a set of humans. It looks for common sequences of words between the two summaries. It is recall oriented

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the BLEU metric?

A

It is a metric used for machine translation. The idea is that a good MT will have the same sequences of words as a human generated translation. It is precision based as it focuses only on how much of the translation it did well. It also penalises translations that are much smaller than the actual translations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Perplexity?

A

It is used to evaluate language models. It predicts how good a vocabulary is at predicting a target text based on the probability of all words in the text appearing in that order. The aim is to minimise the perplexity score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does recall measure?

A

It measures the proportion of relevant items that were selected from all the relevant items

(Items that were classified compared to all items that should have been classified)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does precision measure?

A

It measures the proportion of relevant items that were selected compared to all the items that were selected

(Items that were correctly classified compared to all the items that were classified)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we balance Precision and Recall?

A

We use the F-measure which is the weighted harmonic mean of precision and recall.

F1 balances both factors equally (F1 = 2PR / (P + R))

17
Q

What do we do when multiple classes exist?

A

We can calculate separately for each class and combined (macroaveraging) - is more balanced than micro

Or we can calucalte once based on pooled data for each class (microaveraging) - dominated by frequent classes