Text generation 1: LMs and word embeddings Flashcards
(12 cards)
Create a computational graph for this. What are trainable parameters and what not?
Train this NN one iteration using backpropagation
What are language models?
Models that assign a probability to a given word (or sequence of words) which appear after some sequence of words.
What is a k-th order markov assumption?
An assumption that the next token only depends on the last k tokens
What could be a problem of language models?
- probability of 0 sets the whole prob to 0. We can fix it with add-alpha smoothing
What is add-alpha smoothing?
We give some value 0-1 to every single token (assume that each token is seen at least once (technically less than once)).
How is perplexity calculated and what is the intuition behind it?
- Perplexity of value 10 for example means that the model has the same uncertainty as if it rolled a 10-sided die with a uniform distribution.
What are limitation of n-gram language models?
- Long-range dependencies
- Generalization across contexts (synonyms)
How does the embedding layer/matrix look like and what is it?
It is a |V| x d matrix where |V| is the size of vocabulary and d is the token embedding size. It is basically used for a lookup. If a one-hot vector x is multiplied with the embedding matrix, it returns the column on the j-th size (where 1 is) which corresponds to the embedding of the j-th token.
How to build a NN that will, for given k-gram of tokens output the probability distribution accross the whole vocabulary that predicts the next k+1 token? What is the training data?
- Input layer: k-number of embedding vectors (lookup from the embedding matrix), so k*d where d is the embedding vector size
- Output layer: |V| size where each neuron is the probability of that token being the next one
- In-between, we can stack as many layers as we want, but at the end, we need a softmax function, and after each layer we need a non-linear activation function.
Training can be unlabeled corpus of data, we do a sliding window of k tokens and try to predict the k+1 (we know the ground truth) and is used as a target word. Cross-entropy loss is used.
What are some pros and cons of neural LMs?
- Fixed input size. For k-gram, input is always k tokens, can’t do more.
- Softmax in the end is expensive
How to sample the next token from a neural LM?