Language Modelling Flashcards
What is a language model?
It is a model that assigns probabilities to sequences of worsd
What is the most basic of language models?
The n-gram model, which assigns probabilities to sentences and sequences of words
What can n-gram models be used to?
Estimate the probability of the last word of an n-gram given the previous words, and assign probabilities to entire sequences
Where can language models be used?
Speech Recognition
Spelling correction or grammatical error correction
Machine translation
Augmentative and alternative communication systems
What does a unigram assume?
That the appearance of words are independent of each other
P(w1, w2, w3, w4) = P(w1) * P(w2) * P(w3) * P(w4)
It assumes that the previous words have no influence on the next word
What do n-gram models inform us?
The probability of the next word in the text is dependent on the previous n-1 words in the text.
If we have a word w, and some history h, we want to find out of the times that h occurred, how many times was it followed by w.
In general, how do n-gram probabilities work?
P(w1) = P(w1)
P(w1,w2) = P(w2 | w1) * P(w1)
P(w1, …, wn) = P(wn | w1, …, wn-1) * P(w1, …, wn-1)
How does the bigram model approximate the probability of the next word?
We can use the probability of the next word given the previous word:
P(wn | wn-1)
What assumption is made with the n-gram model?
The Markov Assumption - the probability of a word depends only on the previous word, this can be generalised to trigrams and so on
What does MLE stand for?
Maximum Likelihood Estimation
What does MLE do?
It is the process of choosing the right set of bigram parameters to make our model correctly predict (maximise the likelihood of) the nth word in the the text
How do we obtain the MLE for the parameters of an n-gram model?
We observe the n-gram counts from a representative corpus Normalise them (dividing by a total count) to lie between 0 and 1
What format is used when computing language model probabilities?
We use the log format so we can add the probabilities instead of multiplying them
What are some n-gram problems?
Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.
What are some n-gram problems?
Even if we have a large corpus, only a tiny minority of possible n-grams exist in any corpus. The probability of a word appearing given the previous word is 0 for most sequences due to the sparsity of the matrix.
What do we use to keep a language model from assigning a zero probability to unseen n-grams?
A number of smoothing/discounting methods.
What does Laplace smoothing do?
It adds 1 to all the bigram counts, before we normalise them into probabilities, meaning that all counts that were 0 are now 1, 1s become 2s, and so one.
What is another name for Laplace smoothing?
Add-one smoothing
How do we normalise the count when we do Laplace smoothing?
We add one to the observed bigram, and divide by the number of appearances of N, plus V, where V is the the total number of words in the vocabulary
What can we do if instead of changing the numerator and denominator in Laplace smoothing?
We can add one to the count, multiplied by N and all divided by N + V
What are some problems with Laplace smoothing?
The extra V observations added to normalisation are problematic as too much probability mass has been moved to the zero-occurrence cases
What is k-smoothing?
It is a way of moving a bit less probability mass from the seen to unseen events.
What is the equation for k-smoothing?
Does k-smoothing solve the problem of imbalanced probability counts?
No - it can still be problematic