Multimodal LLMs Flashcards

(20 cards)

1
Q

What is the main limitation of context-free word embeddings like Word2Vec and GloVe?

A

They assign the same embedding to a word regardless of its context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In transformer models, the purpose of positional encoding is to:

A

Encode the sequence order of tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following is true about self-attention?

A

It compares elements within the same sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which task is BERT primarily designed for?

A

Masked Language Modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of the query in a cross-attention mechanism?

A

To query information from keys and values from another modality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In CLIP, the Image Encoder is most commonly based on which architecture?

A

ResNet or Vision Transformer (ViT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training?

A

Decoder-only Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In multimodal retrieval, cosine similarity is used to:

A

Compute distances between image and text embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which type of tokenization is typically used in modern LLMs to handle rare words efficiently?

A

Subword tokenization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In Vision Transformers (ViT), the input image is first:

A

Split into patches and embedded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following are advantages of using word embeddings?
A) Fixed vector length
B) Independent of vocabulary size
C) Direct interpretation of sentence grammar
D) Quantifiable semantic similarity

A

Fixed vector length; Quantifiable semantic similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which layers or blocks are typically involved in the Transformer encoder?
A) Multi-head self-attention
B) Feed-forward networks
C) Positional encoding
D) Convolutional layers

A

Multi-head self-attention; Feed-forward networks; Positional encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In the Attention mechanism, which components must have the same dimensionality?
A) Queries
B) Keys
C) Values
D) Embeddings

A

Queries; Keys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which tasks are mentioned as applications for multimodal transformers?
A) Image Captioning
B) Vision Question Answering (VQA)
C) Text Summarization
D) Speech Recognition

A

Image Captioning; Vision Question Answering (VQA); Speech Recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following are properties required for positional encoding?
A) Unique for each time step
B) Dependent on input length
C) Deterministic
D) Parallelizable

A

Unique for each time step; Deterministic; Parallelizable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following statements about CLIP are true?
A) It uses contrastive learning to align images and text
B) It uses a shared encoder for text and image
C) It computes cosine similarity between image and text embeddings
D) It requires fine-tuning for every new task

A

It uses contrastive learning to align images and text; It computes cosine similarity between image and text embeddings

17
Q

Which techniques are used to generate embeddings?
A) CNNs
B) MLPs
C) Tokenizers
D) Autoencoders

A

CNNs; MLPs; Autoencoders

18
Q

Cross-attention in the decoder is used in which tasks?
A) Speech Recognition
B) Image Captioning
C) Multimodal Retrieval
D) Text Classification

A

Speech Recognition; Image Captioning

19
Q

In a Vision Transformer (ViT), what operations are applied to image patches?
A) Flattening
B) Linear projection to embedding
C) Sinusoidal positional encoding
D) Token masking

A

Flattening; Linear projection to embedding; Sinusoidal positional encoding

20
Q

Which are characteristics of Encoder-Decoder architectures?
A) Encoder learns contextual embeddings
B) Decoder predicts next token
C) Only works for vision tasks
D) Can use cross-attention

A

Encoder learns contextual embeddings; Decoder predicts next token; Can use cross-attention