Text Generation 3: Transformers Flashcards
(7 cards)
Do RNNs have token positioning?
Kinda, yes, since cell output at time t depends on all previous tokens, but it is not super clear, it is more of a philosophical question.
What are drawbacks of recurrent networks?
- scale poorly, problematic after 4-8 layers (prob vanishing gradient)
- closer vocab: assumed one word = one vector (no BPE)
- the hidden state has the info from the previous words (or in case of bi-directional RNNs, both sides), so it holds the meaning of the current word as well as the summary of the entire sequence. This is difficult
- sequential: impossible to parallelize
What is the goal of encoder in transformer model? How is it done?
To contextualize word embeddings: trough multiple layers of attention, normalization, MLP… , the embedding vector change based on the surrounding context.
Each layer/block has its own query, key, value linear transformation (matrices), and they transform the inputs of the current layer into keys, queries and values. Then, the scaled dot-product attention is computed. This means that every hidden state is both query and value and key, so this is called self-attention. Multi-head because matrices are reshaped into head-number of partitions and are computed in parallel.
After the multi-head attention, we sum the result with the input of multi-head attention (residual connection) and normalize it.
This is fed into a Feed forward NN, and then again, the input of NN is summed with the output and passed through a normalization layer.
This process is done multiple times (different layers) with different parameters. Output of one is fed as an input to the second one etc.
How do we achieve parallelization in transformers?
Using reshaping trick. Each matrix (key,value,query) can be split into multiple parts (num of heads), each head is processed in parallel and then, after the processing is done is put back together
What are positional embeddings and why do we need them?
Positional embeddings are a way to give context about position of each token in a sentence. This is performed before the encoder step and is summed with the input embedding.
We need it because, during the parallel processing, we don’t have information about word order (compared to RNN that has it natively due to sequential nature of it).
We use sine and cosine waves for this
What are trained positional embeddings and its pros/cons.
We have a matrix of positional embeddings and train it together with our model. Many problems like additional processing, how do we choose the size of a matrix? What happens when test data constains longer sequences than training data?
What embedding model is used in transfrmers?
Byte-pair encoding: allows any words, even OOV to be included