Text generation 1: LMs and word embeddings Flashcards

(12 cards)

1
Q

Create a computational graph for this. What are trainable parameters and what not?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Train this NN one iteration using backpropagation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are language models?

A

Models that assign a probability to a given word (or sequence of words) which appear after some sequence of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a k-th order markov assumption?

A

An assumption that the next token only depends on the last k tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What could be a problem of language models?

A
  • probability of 0 sets the whole prob to 0. We can fix it with add-alpha smoothing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is add-alpha smoothing?

A

We give some value 0-1 to every single token (assume that each token is seen at least once (technically less than once)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is perplexity calculated and what is the intuition behind it?

A
  • Perplexity of value 10 for example means that the model has the same uncertainty as if it rolled a 10-sided die with a uniform distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are limitation of n-gram language models?

A
  • Long-range dependencies
  • Generalization across contexts (synonyms)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the embedding layer/matrix look like and what is it?

A

It is a |V| x d matrix where |V| is the size of vocabulary and d is the token embedding size. It is basically used for a lookup. If a one-hot vector x is multiplied with the embedding matrix, it returns the column on the j-th size (where 1 is) which corresponds to the embedding of the j-th token.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to build a NN that will, for given k-gram of tokens output the probability distribution accross the whole vocabulary that predicts the next k+1 token? What is the training data?

A
  • Input layer: k-number of embedding vectors (lookup from the embedding matrix), so k*d where d is the embedding vector size
  • Output layer: |V| size where each neuron is the probability of that token being the next one
  • In-between, we can stack as many layers as we want, but at the end, we need a softmax function, and after each layer we need a non-linear activation function.

Training can be unlabeled corpus of data, we do a sliding window of k tokens and try to predict the k+1 (we know the ground truth) and is used as a target word. Cross-entropy loss is used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some pros and cons of neural LMs?

A
  • Fixed input size. For k-gram, input is always k tokens, can’t do more.
  • Softmax in the end is expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to sample the next token from a neural LM?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly