Multimodal LLMS explanation Flashcards

Question 1

Q

What is the main limitation of context-free word embeddings like Word2Vec and GloVe?

Answer

A

They assign the same embedding to a word regardless of its context
Context-free embeddings ignore how word meaning changes depending on surrounding text.

Question 2

Q

In transformer models, the purpose of positional encoding is to:

Answer

A

Encode the sequence order of tokens
Transformers are order-agnostic, so this injects sequence info.

Question 3

Q

Which of the following is true about self-attention?

Answer

A

It compares elements within the same sequence
Self-attention computes weighted combinations of elements across the input.

Question 4

Q

Which task is BERT primarily designed for?

Answer

A

Masked Language Modeling
BERT is trained to fill in masked words in a sentence.

Question 5

Q

What is the role of the query in a cross-attention mechanism?

Answer

A

To query information from keys and values from another modality
In cross-attention, Q attends to K/V from another source.

Question 6

Q

In CLIP, the Image Encoder is most commonly based on which architecture?

Answer

A

ResNet or Vision Transformer (ViT)
CLIP can use either ResNet or ViT for encoding images.

Question 7

Q

Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training?

Answer

A

Decoder-only Transformers
Masked self-attention blocks future information in autoregressive models.

Question 8

Q

In multimodal retrieval, cosine similarity is used to:

Answer

A

Compute distances between image and text embeddings
Cosine similarity checks alignment between vectors regardless of magnitude.

Question 9

Q

Which type of tokenization is typically used in modern LLMs to handle rare words efficiently?

Answer

A

Subword tokenization
Subword units like BPE help balance vocab size and coverage.

Question 10

Q

In Vision Transformers (ViT), the input image is first:

Answer

A

Split into patches and embedded
ViT breaks images into patches and embeds them like tokens.

Question 11

Q

Which of the following are advantages of using word embeddings?
A. Fixed vector length
B. Quantifiable semantic similarity
C. Character-level modeling
D. Direct text generation

Answer

A

Fixed vector length; Quantifiable semantic similarity
Embeddings are fixed-length and allow measuring similarity between words.

Question 12

Q

Which layers or blocks are typically involved in the Transformer encoder?
A. Multi-head self-attention
B. Feed-forward networks
C. Positional encoding
D. LSTMs

Answer

A

Multi-head self-attention; Feed-forward networks; Positional encoding
These are the core blocks of the encoder architecture.

Question 13

Q

In the Attention mechanism, which components must have the same dimensionality?
A. Queries
B. Keys
C. Values
D. Embeddings

Answer

A

Queries; Keys
For dot product attention, Q and K must have the same size.

Question 14

Q

Which tasks are mentioned as applications for multimodal transformers?
A. Image Captioning
B. Vision Question Answering (VQA)
C. Speech Recognition
D. Object Detection

Answer

A

Image Captioning; Vision Question Answering (VQA); Speech Recognition
These are real-world use cases for multimodal models.

Question 15

Q

Which of the following are properties required for positional encoding?
A. Unique for each time step
B. Deterministic
C. Parallelizable
D. Stochastic

Answer

A

Unique for each time step; Deterministic; Parallelizable
Classic encodings must be reproducible and allow efficient computation.

Question 16

Q

Which of the following statements about CLIP are true?
A. It uses contrastive learning to align images and text
B. It uses the same encoder for text and images
C. It computes cosine similarity between image and text embeddings
D. It requires fine-tuning for each task

Answer

Study These Flashcards

A

It uses contrastive learning to align images and text; It computes cosine similarity between image and text embeddings
CLIP aligns vision and text in a joint embedding space using cosine similarity.

Question 17

Q

Which techniques are used to generate embeddings?
A. CNNs
B. MLPs
C. Autoencoders
D. Tokenizers

Answer

Study These Flashcards

A

CNNs; MLPs; Autoencoders
Various models can learn to encode input data into vectors.

Question 18

Q

Cross-attention in the decoder is used in which tasks?
A. Speech Recognition
B. Image Captioning
C. Language Modeling
D. Style Transfer

Answer

Study These Flashcards

A

Speech Recognition; Image Captioning
Cross-attention lets the decoder reference the encoder output.

Question 19

Q

In a Vision Transformer (ViT), what operations are applied to image patches?
A. Flattening
B. Linear projection to embedding
C. Sinusoidal positional encoding
D. Token masking

Answer

Study These Flashcards

A

Flattening; Linear projection to embedding; Sinusoidal positional encoding
These convert image patches to token-like embeddings.

Question 20

Q

Which are characteristics of Encoder-Decoder architectures?
A. Encoder learns contextual embeddings
B. Decoder predicts next token
C. Can use cross-attention
D. Only works on text

Answer

Study These Flashcards

A

Encoder learns contextual embeddings; Decoder predicts next token; Can use cross-attention
Encoder-decoder models are used for sequence-to-sequence tasks like translation.

Multimodal LLMS explanation Flashcards

(20 cards)