Transformes Flashcards

Question

How the key, query and value are computed when having the token's embedding vector ?

Answer 1

A

Query is what you’re looking for. Query (Q) — what this word wants to focus on. Query (Q) is the search text you type in the search engine bar. This is the token you want to “find more information about”.

Answer 2

A

To predict the next word in a sequence based on the context provided by previous words.

Answer 3

A

Embedding layer Transformer blocks (comprising attention mechanisms and MLPs) Output probabilities layer

Answer 4

A

To split input text into smaller units called tokens, which can be words or subwords, facilitating numerical representation.

Answer 5

A

It provides information about the position of each token in the sequence, allowing the model to capture the order of words.

Answer 6

A

To allow the model to focus on relevant parts of the input sequence when generating each part of the output, capturing dependencies regardless of their distance in the sequence.

Answer 7

A

The embedding layer converts input tokens into dense vector representations, capturing semantic meaning and positional information.

Answer 8

A

Each Transformer block consists of multi-head self-attention mechanisms and feedforward neural networks (MLPs), along with residual connections and layer normalization.

Answer 9

A

Multi-head self-attention allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships and dependencies.

Answer 10

A

This layer transforms the final hidden states into a probability distribution over the vocabulary, enabling the model to predict the next token.

Answer 11

A

The final hidden state is passed through a linear transformation followed by a softmax function to produce the probability distribution over possible next tokens.

Answer 12

A

To convert input tokens into dense vector representations that capture semantic meaning and positional information, enabling the model to process textual data numerically.

Answer 13

A

Tokenization Token Embedding Positional Encoding Final Embedding (sum of token and positional embeddings)

Answer 14

A

Tokenization breaks down input text into smaller units called tokens (words or subwords), which are then mapped to numerical representations for processing by the model.

Answer 15

A

GPT-2 represents each token as a 768-dimensional vector, with all token embeddings stored in a matrix of shape (50,257, 768), corresponding to its vocabulary size and embedding dimension.

Answer 16

Study These Flashcards

A

By summing the token embedding and its corresponding positional encoding, resulting in a vector that encapsulates both the token’s meaning and its position in the sequence.

Answer 17

Study These Flashcards

A

The token embedding matrix in GPT-2 has a shape of (50,257, 768), representing 50,257 unique tokens each mapped to a 768-dimensional vector, capturing the semantic meaning of each token.

Answer 18

Study These Flashcards

A

Top-k sampling selects the next word from the top ‘k’ probable words, while top-p sampling selects from the smallest set of words whose cumulative probability exceeds a threshold ‘p’.

Answer 19

Study These Flashcards

A

A lower temperature makes the model’s output more deterministic, while a higher temperature increases randomness and creativity in the generated text.

Answer 20

Study These Flashcards

A

To process each token’s representation independently, refining the information after the attention mechanism has been applied.

Answer 21

Study These Flashcards

A

They help in training deep networks by allowing gradients to flow through the network more effectively, mitigating issues like vanishing gradients.

Answer 22

Study These Flashcards

A

Layer normalization stabilizes and accelerates training by normalizing the inputs across the features, ensuring consistent mean and variance.

Answer 23

Study These Flashcards

A

Key (K) — what each word in the sequence offers as information. Keys are tags/labels on the books. Key (K) is the title of each web page in the search result window. It represents the possible tokens the query can attend to.

Answer 24

Study These Flashcards

A

Value (V) — the actual information each word carries. Values are the actual content of the books. Value (V) is the actual content of web pages shown. Once we matched the appropriate search term (Query) with the relevant results (Key), we want to get the content (Value) of the most relevant pages.

Transformes Flashcards

(25 cards)