Transformes Flashcards
What is a query?
Query is what you’re looking for. Query (Q) — what this word wants to focus on. Query (Q) is the search text you type in the search engine bar. This is the token you want to “find more information about”.
What is the primary function of a Transformer model in MLP?
To predict the next word in a sequence based on the context provided by previous words.
What are the main components of a text-generative Transformer model?
Embedding layer Transformer blocks (comprising attention mechanisms and MLPs) Output probabilities layer
What is the purpose of tokenization in Transformer models?
To split input text into smaller units called tokens, which can be words or subwords, facilitating numerical representation.
How does positional encoding contribute to a Transformer’s understanding of sequences?
It provides information about the position of each token in the sequence, allowing the model to capture the order of words.
What is the role of the attention mechanism in Transformers?
To allow the model to focus on relevant parts of the input sequence when generating each part of the output, capturing dependencies regardless of their distance in the sequence.
What is the purpose of the embedding layer in a Transformer model?
The embedding layer converts input tokens into dense vector representations, capturing semantic meaning and positional information.
What are the core components of a Transformer block?
Each Transformer block consists of multi-head self-attention mechanisms and feedforward neural networks (MLPs), along with residual connections and layer normalization.
How does multi-head self-attention benefit the Transformer model
Multi-head self-attention allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships and dependencies.
What is the role of the output probabilities layer in a Transformer?
This layer transforms the final hidden states into a probability distribution over the vocabulary, enabling the model to predict the next token.
How is the output probability distribution typically computed in Transformers?
The final hidden state is passed through a linear transformation followed by a softmax function to produce the probability distribution over possible next tokens.
What is the purpose of the embedding layer in a Transformer model?
To convert input tokens into dense vector representations that capture semantic meaning and positional information, enabling the model to process textual data numerically.
What are the four main steps in converting input text into embeddings in a Transformer model?
Tokenization Token Embedding Positional Encoding Final Embedding (sum of token and positional embeddings)
Why is tokenization important in the embedding process?
Tokenization breaks down input text into smaller units called tokens (words or subwords), which are then mapped to numerical representations for processing by the model.
How does GPT-2 represent each token in its vocabulary?
GPT-2 represents each token as a 768-dimensional vector, with all token embeddings stored in a matrix of shape (50,257, 768), corresponding to its vocabulary size and embedding dimension.
How is the final embedding for each token obtained?
By summing the token embedding and its corresponding positional encoding, resulting in a vector that encapsulates both the token’s meaning and its position in the sequence.
What is the shape of the token embedding matrix in GPT-2, and what does it represent?
The token embedding matrix in GPT-2 has a shape of (50,257, 768), representing 50,257 unique tokens each mapped to a 768-dimensional vector, capturing the semantic meaning of each token.
What is the difference between top-k and top-p (nucleus) sampling in text generation?
Top-k sampling selects the next word from the top ‘k’ probable words, while top-p sampling selects from the smallest set of words whose cumulative probability exceeds a threshold ‘p’.
How does adjusting the temperature parameter affect text generation in Transformers?
A lower temperature makes the model’s output more deterministic, while a higher temperature increases randomness and creativity in the generated text.
What is the function of the MLP (Multilayer Perceptron) within a Transformer block?
To process each token’s representation independently, refining the information after the attention mechanism has been applied.
Why are residual connections important in Transformer architectures?
They help in training deep networks by allowing gradients to flow through the network more effectively, mitigating issues like vanishing gradients.
What is layer normalization, and why is it used in Transformers?
Layer normalization stabilizes and accelerates training by normalizing the inputs across the features, ensuring consistent mean and variance.
What is a key?
Key (K) — what each word in the sequence offers as information. Keys are tags/labels on the books. Key (K) is the title of each web page in the search result window. It represents the possible tokens the query can attend to.
What is a value?
Value (V) — the actual information each word carries. Values are the actual content of the books. Value (V) is the actual content of web pages shown. Once we matched the appropriate search term (Query) with the relevant results (Key), we want to get the content (Value) of the most relevant pages.