N-gram Language Models Flashcards

1
Q

ew

Language Models

A

Models that assign probabilities to sequences of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

N-gram

A

A sequence of n words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

r

Markov models

A

A class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Extrinsic Evaluation

A

An end-to-end evaluation.

E.g. embedding the model in an application and measuring how much the application improves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Intrinsic Evaluation Metric

A

A metric that measures the quality of a model independent of any application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Perplexity

A

The perplexity (PP) of a language model on a test set is the inverse probability of the test set, normalised by the number of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

rrssw

OOV rate

A

The percentage of out of vocabulary words that appear in the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Open vocabulary system

A

A system in which we model potential unknown words in the test set by adding a pseudo-word called <UNK>.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Smoothing

A

A method of dealing with words that are in our vocabulary, but appear in the test set in an unknown context (e.g. after a word they never appeared after in training).

To keep the model from assigning zero probability, shave off a bit of probability mass from some more frequent events and give it to events never seen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Laplace Smoothing

A

The simplest smoothing technique.

Add one to all the n-gram counts, before normalising them into probabilities.

All the counts that used to be zero will now have a count of 1, counts of 1 will be 2, etc.

a.k.a. add-one smoothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

add-k smoothing

A

Like Laplace smoothing. But instead of adding 1 to each count, we add a fractional count k (.5, .05, .01).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

backoff smoothing

A

We use the trigram if the evidence is sufficient, otherwise use the bigram, otherwise the unigram.

I.e. we only “back off” to a lower-order n-gram if we have zero evidence for a higher order n-gram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

interpolation smoothing

A

We mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram and unigram counts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Discount for a backoff model

A

In order for a backoff model to give a correct probability distribution, we have to discount the higher-order n-grams to save some probability mass for the lower order n-grams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

P_CONTINUATION

A

Instead of P(w), it tries to answer the question, “How likely is w to appear as a novel continuation?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Kneyser-Ney intuition for P_CONTINUATION

A

Base our estimate of P_CONTINUATION on the number of different contexts the word w has appeard in.

I.e. the number of bigram types it completes.

17
Q

Stupid backoff

A

Stupid backoff does not try to make the language model a true probability distribution. There is no discounting of higher-order probabilities.

If a higher-order n-gram has a zero count, we backoff to a lower order n-gram, weighed by a fixed (context-independent) weight.

S(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ₋₁) =

count(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ) /
count(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ₋₁)

> if count(wᵢ | wᵢ₋ₖ₊₁ ܄ ᵢ) > 0

λ S(wᵢ | wᵢ₋ₖ₊₂ ܄ ᵢ₋₁)

> otherwise

The backoff terminates in the unigram, which has probability

S(w) = count(w) / N

18
Q
A