Language models Flashcards

1
Q

Language modeling and problem formulation

A

Language modeling is the task of predicting which word comes next in a sequence of words.

More formally, given a sequence of words w1w2 … wt we want to know the probability of the next word wt+1:

P(wt+1|w1w2 … wt) = P(w1:t+1)/P(w1:t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Language models as generative models

A

Generative models are a type of machine learning models that are trained to generate new data instances similar to the ones in the training data.

Rather than as predictive models, language models can also be viewed as generative models that assign probability to a piece of text:

P(w1 … wt) = P(w1:t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Probability of a sequence of words as a product of conditional probabilities

A

P(w1:n) = product t=1 to n P(wt|w1:t-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types vs tokens in a corpus

A
  • Types are the elements of the vocabulary V associated with the corpus, that is, the distinct words of the corpus.
  • Tokens are the running words (occurrences). The length of a corpus is the number of tokens
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Relative frequency estimator in the next word prediction task

A

P(wt|w1:t-1) = C(w1:t)/C(w1:t-1)

This estimator is very data-hungry, and suffers from high variance: depending on what data happens to be in the corpus, we could get very different probability estimations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

N-gram model, complexity, N-gram probabilities and bias-variance tradeoff

A

A string w(t-N+1):t of N words is called N-gram.

The N-gram model approximates the probability of a word given the entire sentence history by conditioning only on the past N-1 words.

General equation for the N-gram model:

P(wt|w1:t-1) = P(wt|w(t-N+1):t-1)

Give the relative frequency estimator…

N-gram model is exponential wrt N (V^N)

N is a hyperparameter. When setting its value, we face the bias-variance tradeoff:

  • When N is too small (high bias), it fails to recover long-distance word relations
  • When N is too large, we get data sparsity (high variance)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Example of computing next word probability using an N-gram model (Slide 16 pdf 5…)

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Evaluation of LMs: Intrinsic evaluation and perplexity measure

A

Intrinsic evaluation of language models is based on the inverse probability of the test set, normalized by the number of words.

For a test set W = w1w2 … wn we define perplexity as:

PP(W) = P(1:n)^(-1/n)

The multiplicative inverse probability 1/P(wj|w1:j-1) can be seen as a measure of how surprising the next word is.

The degree of the root averages over all words of the test set, providing average surprise per word.

The lower the perplexity, the better the model.

An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in the performance of a language processing task like speech recognition or machine translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sparse data: zero or undefined probabilities for N-gram models

A

Using the relative frequency estimator in LMs: if there isn’t enough data in the training set, counts will be zero for some grammatical sequences.

There are three scenarios we need to consider:

  • zero numerator: smoothing, discounting
  • zero denominator: backoff, interpolation
  • out-of-vocabulary words in test set: estimation of unknown words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Smoothing techniques

A

Given the Relative frequency estimator:

P(wt|w1:t-1) = C(w1:t)/C(w1:t-1)

Smoothing techniques (also called discounting) deal with words that are in our vocabulary V but were never seen before in the given context (zero numerator).

Smoothing prevents LM from assigning zero probability to these events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Laplace smoothing

A

IDEA:
Pretend that everything occurs once more than the actual count.

Consider 1-gram model:

PL(wt) = (C(wt)+1)/(n+V).

Consider C*, the relative discount d(wt):

d(wt) = C*(wt)/C(wt) > 1

for high frequency words (do the calculations..).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Add-k smoothing

A

Add-k smoothing is a generalization of add-1 smoothing (that is the Laplace smoothing).

For some 0 <= k < 1 (considering the 2-gram model):

PAdd-k(wt|wt-1) = (C(wt-1..wt)+k)/(C(wt-1)+kV)

Jeffreys-Perks law corresponds to the case k = 0.5, which works well in practice and benefits from some theoretical justification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Backoff and interpolation techniques

A

Deal with words that are in our vocabulary, but in the test set combine to form previously unseen contexts.

These techniques prevent LM from creating undefined probabilities for these events (zero/zero).

IDEA:

  • if you have trigrams, use trigrams
  • if you don’t have trigrams, use bigrams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Stupid backoff

A

With very large text collections (web-scale) a rough approximation of Katz backoff is often sufficient, called stupid backoff.

Give the recursive Ps(wt|wt-N+1:t-1) = … (using λ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Linear interpolation

A

In simple linear interpolation, we combine different order N-grams by linearly interpolating all the models.

PL(wt|wt-2 wt-1) = λ1P(wt|wt-2 wt-1) + λ2P(wt|wt-1) + λ3P(wt)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unknown/unfrequent words, how can we handle them?

A

Unknown words, also called out of vocabulary (OOV) words, are words we haven’t seen before.

Replace by new word token UNK all words that occur fewer than d times in the training set, d some small number.

17
Q

Limitations of the N-gram models

A
  • Scaling to larger N-gram sizes is problematic
  • Smoothing techniques are intricate
  • N-gram models are unable to share statistical strength across word boundaries (‘red apple’ do not affect estimates for ‘green apple’)
18
Q

Neural language models advantages and basic idea

A

Main advantages of NLM:

  • can generalize better over contexts of similar words, and are more accurate at word-prediction
  • can incorporate arbitrarily distant contextual information, while remaining computationally and statistically tractable

IDEA:

  1. get a vector representation for the previous context
  2. generate a probability distribution for the next token

Most natural choice for NN architecture is recurrent neural network (RNN) but feedforward neural network (FNN) and convolutional neural network (CNN) have also been exploited.

19
Q

Feedforward NLM

A

Uses (like the N-gram language model) P(wt|w1:t-1) ≃ P(wt|wt-N+1:t-1)

  1. We represent each input word wt as a one-hot vector xt of size V
  2. At the first layer:
    - we convert one-hot vectors for the words in the N-1 window into word embeddings of size d
    - we concatenate the N-1 embeddings (e = [Ext-3; Ext-2; Ext-1]) where e is the concatenation of the N-1 embeddings of the previous words
  3. h = g(We+b) <- vector representation of the N-1 words
  4. z = Uh <- transform dimension of h from d to the vocabulary size V
  5. ŷ = softmax(z)

explain vector dimensions…

20
Q

RNN for language modeling

A

RNN language models process the input one word at a time, predicting the next word from:

  • the current word
  • the previous hidden state
  1. et = Ext
  2. ht = g(Uht-1 + Wet)
  3. yt = softmax(Vht)

where h is the “hidden state”.

explain vector dimensions…

In the end:

  • the columns of E provide the learned word embeddings
  • the rows of V provide a second set of learned word embeddings, that capture relevant aspects of word meaning and function
21
Q

Weight tying in RNNs

A

In the recurrent NLM model: weight tying, also known as parameter sharing, means that we impose E transpose = V.

Weight tying can significantly reduce model size, and has an effect similar to regularization, preventing overfitting of the NLM.

22
Q

RNN pratical issues

A
  • vanishing gradient problem
  • softmax over the entire vocabulary, dominates the computation both at training and at test time (alternatives Hierarchical softmax and Adaptive softmax)
23
Q

Method for modifying language model behavior in RNN (coherence and diversity trade-off)

A

A very popular method for modifying language model behavior is to change the softmax temperature τ.

Low τ produces peaky distribution (high coherence); large τ produces flat distribution (high diversity).

write the formula…

24
Q

Contrastive evaluation

A

Is used to test specific linguistic constructions in NLM.

E.g. P(is|w1:t-1) vs P(are|w1:t-1)