Models & Architecture Flashcards

Question

How CBOW Works

Answer 1

Define the Context Window: Choose a window size 𝑐 c that specifies how many words before and after the target word are considered. For example, if 𝑐 = 2 c=2, and the sentence is: "The quick brown fox jumps" "The quick brown fox jumps" For the target word "brown," the context is: ["The", "quick", "fox", "jumps"]. One-Hot Encoding of Words: Represent each word in the vocabulary as a one-hot vector, where the dimension equals the size of the vocabulary. For example, if the vocabulary size is 10,000, each word is a vector of length 10,000 with a single 1 indicating the word. Input Representation: Compute the average of the one-hot encoded vectors of the context words. This represents the input for the model. Hidden Layer Transformation: Pass the input through a hidden layer with weights 𝑊 W, resulting in a lower-dimensional dense vector representation. This is the embedding space. Mathematically: is the one-hot encoded context word vector. Output Layer and Softmax: The hidden layer vector is multiplied by another set of weights to produce a score for each word in the vocabulary. Apply the softmax function to convert these scores into probabilities, representing the likelihood of each word being the target. Prediction and Loss Calculation: The target word's probability is compared to the actual target word using a loss function (typically cross-entropy loss). The model adjusts weights using backpropagation to reduce the loss. Training: Repeat the process for every word in the corpus across multiple epochs. The embeddings are refined iteratively until the model converges.

Answer 2

Context window size 𝑐 = 2 Target: "machine". Context: ["I", "love", "learning", "models"]. The process: Convert context words ("I", "love", "learning", "models") to one-hot vectors. Average the one-hot vectors. Pass the averaged vector through the hidden layer to obtain a dense embedding. Use the output layer to predict "machine."

Answer 3

They assist in the understanding of weights

Answer 4

Self-attention computes the importance of each word in a sequence relative to every other word in the same sequence. Unlike static word embeddings (e.g., Word2Vec, GloVe), self-attention produces contextualized embeddings, where a word’s representation changes depending on the surrounding context.

Answer 5

The word "bank" has different meanings. Using self-attention, the model considers surrounding words like "river" (sentence 1) or "money" (sentence 2) to generate contextually relevant embeddings.

Answer 6

quantify how closely two data points (e.g., words, sentences, or other entities) resemble each other. These values are central to many machine learning, natural language processing (NLP), and information retrieval tasks. They are calculated using similarity metrics, which measure the relationship between two vectors, sets, or other data representations.

Answer 7

Ering = s1 * Eshe + s2 * Ewore + s2 * a + s3 * E ring

Answer 8

query, key and value

Answer 9

The dot product = QK (similarities scores)

Answer 10

to add up the similarities to amount to 1

Answer 11

the embedding matrix

Answer 12

is a deep learning architecture introduced by Vaswani et al. in their 2017 paper "Attention is All You Need." It revolutionized natural language processing (NLP) and machine learning tasks by introducing the self-attention mechanism, which allows the model to weigh the importance of different input elements dynamically.

Answer 13

Self-Attention Mechanism: It enables the model to capture relationships between words in a sequence regardless of their distance from each other. Each word (or token) attends to every other word in the sequence, creating context-aware representations. Positional Encoding: Since Transformers do not process data sequentially (like RNNs), positional encodings are added to input embeddings to retain order information. Encoder-Decoder Structure: Encoder: Processes the input sequence and generates a contextual representation for each token. Decoder: Uses the encoder's outputs and prior predictions to generate the target sequence. Multi-Head Attention: Multiple attention mechanisms run in parallel, allowing the model to focus on different aspects of the sequence simultaneously. Feedforward Layers: After the attention mechanism, each token's representation passes through a feedforward neural network to enhance its expressiveness. Layer Normalization and Residual Connections: Help stabilize training and enable deep architectures by preventing vanishing/exploding gradient problems. Scalability and Parallelization: Unlike RNNs, Transformers process input sequences in parallel, leading to significant improvements in training speed.

Answer 14

It is a neural network that learns to compress and reconstruct input data

Models & Architecture Flashcards

(40 cards)