Multimodal LLMS explanation Flashcards

(20 cards)

1
Q

What is the main limitation of context-free word embeddings like Word2Vec and GloVe?

A

They assign the same embedding to a word regardless of its context
Context-free embeddings ignore how word meaning changes depending on surrounding text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In transformer models, the purpose of positional encoding is to:

A

Encode the sequence order of tokens
Transformers are order-agnostic, so this injects sequence info.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following is true about self-attention?

A

It compares elements within the same sequence
Self-attention computes weighted combinations of elements across the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which task is BERT primarily designed for?

A

Masked Language Modeling
BERT is trained to fill in masked words in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of the query in a cross-attention mechanism?

A

To query information from keys and values from another modality
In cross-attention, Q attends to K/V from another source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In CLIP, the Image Encoder is most commonly based on which architecture?

A

ResNet or Vision Transformer (ViT)
CLIP can use either ResNet or ViT for encoding images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training?

A

Decoder-only Transformers
Masked self-attention blocks future information in autoregressive models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In multimodal retrieval, cosine similarity is used to:

A

Compute distances between image and text embeddings
Cosine similarity checks alignment between vectors regardless of magnitude.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which type of tokenization is typically used in modern LLMs to handle rare words efficiently?

A

Subword tokenization
Subword units like BPE help balance vocab size and coverage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In Vision Transformers (ViT), the input image is first:

A

Split into patches and embedded
ViT breaks images into patches and embeds them like tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following are advantages of using word embeddings?
A. Fixed vector length
B. Quantifiable semantic similarity
C. Character-level modeling
D. Direct text generation

A

Fixed vector length; Quantifiable semantic similarity
Embeddings are fixed-length and allow measuring similarity between words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which layers or blocks are typically involved in the Transformer encoder?
A. Multi-head self-attention
B. Feed-forward networks
C. Positional encoding
D. LSTMs

A

Multi-head self-attention; Feed-forward networks; Positional encoding
These are the core blocks of the encoder architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In the Attention mechanism, which components must have the same dimensionality?
A. Queries
B. Keys
C. Values
D. Embeddings

A

Queries; Keys
For dot product attention, Q and K must have the same size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which tasks are mentioned as applications for multimodal transformers?
A. Image Captioning
B. Vision Question Answering (VQA)
C. Speech Recognition
D. Object Detection

A

Image Captioning; Vision Question Answering (VQA); Speech Recognition
These are real-world use cases for multimodal models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following are properties required for positional encoding?
A. Unique for each time step
B. Deterministic
C. Parallelizable
D. Stochastic

A

Unique for each time step; Deterministic; Parallelizable
Classic encodings must be reproducible and allow efficient computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following statements about CLIP are true?
A. It uses contrastive learning to align images and text
B. It uses the same encoder for text and images
C. It computes cosine similarity between image and text embeddings
D. It requires fine-tuning for each task

A

It uses contrastive learning to align images and text; It computes cosine similarity between image and text embeddings
CLIP aligns vision and text in a joint embedding space using cosine similarity.

17
Q

Which techniques are used to generate embeddings?
A. CNNs
B. MLPs
C. Autoencoders
D. Tokenizers

A

CNNs; MLPs; Autoencoders
Various models can learn to encode input data into vectors.

18
Q

Cross-attention in the decoder is used in which tasks?
A. Speech Recognition
B. Image Captioning
C. Language Modeling
D. Style Transfer

A

Speech Recognition; Image Captioning
Cross-attention lets the decoder reference the encoder output.

19
Q

In a Vision Transformer (ViT), what operations are applied to image patches?
A. Flattening
B. Linear projection to embedding
C. Sinusoidal positional encoding
D. Token masking

A

Flattening; Linear projection to embedding; Sinusoidal positional encoding
These convert image patches to token-like embeddings.

20
Q

Which are characteristics of Encoder-Decoder architectures?
A. Encoder learns contextual embeddings
B. Decoder predicts next token
C. Can use cross-attention
D. Only works on text

A

Encoder learns contextual embeddings; Decoder predicts next token; Can use cross-attention
Encoder-decoder models are used for sequence-to-sequence tasks like translation.