Text generation 1: LMs and word embeddings Flashcards

Question 1

Q

Create a computational graph for this. What are trainable parameters and what not?

Question 2

Q

Train this NN one iteration using backpropagation

Question 3

Q

What are language models?

Answer

A

Models that assign a probability to a given word (or sequence of words) which appear after some sequence of words.

Question 4

Q

What is a k-th order markov assumption?

Answer

A

An assumption that the next token only depends on the last k tokens

Question 5

Q

What could be a problem of language models?

Answer

A

probability of 0 sets the whole prob to 0. We can fix it with add-alpha smoothing

Question 6

Q

What is add-alpha smoothing?

Answer

A

We give some value 0-1 to every single token (assume that each token is seen at least once (technically less than once)).

Question 7

Q

How is perplexity calculated and what is the intuition behind it?

Answer

A

Perplexity of value 10 for example means that the model has the same uncertainty as if it rolled a 10-sided die with a uniform distribution.

Question 8

Q

What are limitation of n-gram language models?

Answer

A

Long-range dependencies
Generalization across contexts (synonyms)

Question 9

Q

How does the embedding layer/matrix look like and what is it?

Answer

A

It is a |V| x d matrix where |V| is the size of vocabulary and d is the token embedding size. It is basically used for a lookup. If a one-hot vector x is multiplied with the embedding matrix, it returns the column on the j-th size (where 1 is) which corresponds to the embedding of the j-th token.

Question 10

Q

How to build a NN that will, for given k-gram of tokens output the probability distribution accross the whole vocabulary that predicts the next k+1 token? What is the training data?

Answer

A

Input layer: k-number of embedding vectors (lookup from the embedding matrix), so k*d where d is the embedding vector size
Output layer: |V| size where each neuron is the probability of that token being the next one
In-between, we can stack as many layers as we want, but at the end, we need a softmax function, and after each layer we need a non-linear activation function.

Training can be unlabeled corpus of data, we do a sliding window of k tokens and try to predict the k+1 (we know the ground truth) and is used as a target word. Cross-entropy loss is used.

Question 11

Q

What are some pros and cons of neural LMs?

Answer

A

Fixed input size. For k-gram, input is always k tokens, can’t do more.
Softmax in the end is expensive

Question 12

Q

How to sample the next token from a neural LM?

Text generation 1: LMs and word embeddings Flashcards

(12 cards)