Multimodal LLMs Flashcards
(20 cards)
What is the main limitation of context-free word embeddings like Word2Vec and GloVe?
They assign the same embedding to a word regardless of its context
In transformer models, the purpose of positional encoding is to:
Encode the sequence order of tokens
Which of the following is true about self-attention?
It compares elements within the same sequence
Which task is BERT primarily designed for?
Masked Language Modeling
What is the role of the query in a cross-attention mechanism?
To query information from keys and values from another modality
In CLIP, the Image Encoder is most commonly based on which architecture?
ResNet or Vision Transformer (ViT)
Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training?
Decoder-only Transformers
In multimodal retrieval, cosine similarity is used to:
Compute distances between image and text embeddings
Which type of tokenization is typically used in modern LLMs to handle rare words efficiently?
Subword tokenization
In Vision Transformers (ViT), the input image is first:
Split into patches and embedded
Which of the following are advantages of using word embeddings?
A) Fixed vector length
B) Independent of vocabulary size
C) Direct interpretation of sentence grammar
D) Quantifiable semantic similarity
Fixed vector length; Quantifiable semantic similarity
Which layers or blocks are typically involved in the Transformer encoder?
A) Multi-head self-attention
B) Feed-forward networks
C) Positional encoding
D) Convolutional layers
Multi-head self-attention; Feed-forward networks; Positional encoding
In the Attention mechanism, which components must have the same dimensionality?
A) Queries
B) Keys
C) Values
D) Embeddings
Queries; Keys
Which tasks are mentioned as applications for multimodal transformers?
A) Image Captioning
B) Vision Question Answering (VQA)
C) Text Summarization
D) Speech Recognition
Image Captioning; Vision Question Answering (VQA); Speech Recognition
Which of the following are properties required for positional encoding?
A) Unique for each time step
B) Dependent on input length
C) Deterministic
D) Parallelizable
Unique for each time step; Deterministic; Parallelizable
Which of the following statements about CLIP are true?
A) It uses contrastive learning to align images and text
B) It uses a shared encoder for text and image
C) It computes cosine similarity between image and text embeddings
D) It requires fine-tuning for every new task
It uses contrastive learning to align images and text; It computes cosine similarity between image and text embeddings
Which techniques are used to generate embeddings?
A) CNNs
B) MLPs
C) Tokenizers
D) Autoencoders
CNNs; MLPs; Autoencoders
Cross-attention in the decoder is used in which tasks?
A) Speech Recognition
B) Image Captioning
C) Multimodal Retrieval
D) Text Classification
Speech Recognition; Image Captioning
In a Vision Transformer (ViT), what operations are applied to image patches?
A) Flattening
B) Linear projection to embedding
C) Sinusoidal positional encoding
D) Token masking
Flattening; Linear projection to embedding; Sinusoidal positional encoding
Which are characteristics of Encoder-Decoder architectures?
A) Encoder learns contextual embeddings
B) Decoder predicts next token
C) Only works for vision tasks
D) Can use cross-attention
Encoder learns contextual embeddings; Decoder predicts next token; Can use cross-attention