Multimodal LLMS explanation Flashcards
(20 cards)
What is the main limitation of context-free word embeddings like Word2Vec and GloVe?
They assign the same embedding to a word regardless of its context
Context-free embeddings ignore how word meaning changes depending on surrounding text.
In transformer models, the purpose of positional encoding is to:
Encode the sequence order of tokens
Transformers are order-agnostic, so this injects sequence info.
Which of the following is true about self-attention?
It compares elements within the same sequence
Self-attention computes weighted combinations of elements across the input.
Which task is BERT primarily designed for?
Masked Language Modeling
BERT is trained to fill in masked words in a sentence.
What is the role of the query in a cross-attention mechanism?
To query information from keys and values from another modality
In cross-attention, Q attends to K/V from another source.
In CLIP, the Image Encoder is most commonly based on which architecture?
ResNet or Vision Transformer (ViT)
CLIP can use either ResNet or ViT for encoding images.
Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training?
Decoder-only Transformers
Masked self-attention blocks future information in autoregressive models.
In multimodal retrieval, cosine similarity is used to:
Compute distances between image and text embeddings
Cosine similarity checks alignment between vectors regardless of magnitude.
Which type of tokenization is typically used in modern LLMs to handle rare words efficiently?
Subword tokenization
Subword units like BPE help balance vocab size and coverage.
In Vision Transformers (ViT), the input image is first:
Split into patches and embedded
ViT breaks images into patches and embeds them like tokens.
Which of the following are advantages of using word embeddings?
A. Fixed vector length
B. Quantifiable semantic similarity
C. Character-level modeling
D. Direct text generation
Fixed vector length; Quantifiable semantic similarity
Embeddings are fixed-length and allow measuring similarity between words.
Which layers or blocks are typically involved in the Transformer encoder?
A. Multi-head self-attention
B. Feed-forward networks
C. Positional encoding
D. LSTMs
Multi-head self-attention; Feed-forward networks; Positional encoding
These are the core blocks of the encoder architecture.
In the Attention mechanism, which components must have the same dimensionality?
A. Queries
B. Keys
C. Values
D. Embeddings
Queries; Keys
For dot product attention, Q and K must have the same size.
Which tasks are mentioned as applications for multimodal transformers?
A. Image Captioning
B. Vision Question Answering (VQA)
C. Speech Recognition
D. Object Detection
Image Captioning; Vision Question Answering (VQA); Speech Recognition
These are real-world use cases for multimodal models.
Which of the following are properties required for positional encoding?
A. Unique for each time step
B. Deterministic
C. Parallelizable
D. Stochastic
Unique for each time step; Deterministic; Parallelizable
Classic encodings must be reproducible and allow efficient computation.
Which of the following statements about CLIP are true?
A. It uses contrastive learning to align images and text
B. It uses the same encoder for text and images
C. It computes cosine similarity between image and text embeddings
D. It requires fine-tuning for each task
It uses contrastive learning to align images and text; It computes cosine similarity between image and text embeddings
CLIP aligns vision and text in a joint embedding space using cosine similarity.
Which techniques are used to generate embeddings?
A. CNNs
B. MLPs
C. Autoencoders
D. Tokenizers
CNNs; MLPs; Autoencoders
Various models can learn to encode input data into vectors.
Cross-attention in the decoder is used in which tasks?
A. Speech Recognition
B. Image Captioning
C. Language Modeling
D. Style Transfer
Speech Recognition; Image Captioning
Cross-attention lets the decoder reference the encoder output.
In a Vision Transformer (ViT), what operations are applied to image patches?
A. Flattening
B. Linear projection to embedding
C. Sinusoidal positional encoding
D. Token masking
Flattening; Linear projection to embedding; Sinusoidal positional encoding
These convert image patches to token-like embeddings.
Which are characteristics of Encoder-Decoder architectures?
A. Encoder learns contextual embeddings
B. Decoder predicts next token
C. Can use cross-attention
D. Only works on text
Encoder learns contextual embeddings; Decoder predicts next token; Can use cross-attention
Encoder-decoder models are used for sequence-to-sequence tasks like translation.