Multimodal LLMs Flashcards

Question 1

Q

What is the main limitation of context-free word embeddings like Word2Vec and GloVe?

Answer

A

They assign the same embedding to a word regardless of its context

Question 2

Q

In transformer models, the purpose of positional encoding is to:

Answer

A

Encode the sequence order of tokens

Question 3

Q

Which of the following is true about self-attention?

Answer

A

It compares elements within the same sequence

Question 4

Q

Which task is BERT primarily designed for?

Answer

A

Masked Language Modeling

Question 5

Q

What is the role of the query in a cross-attention mechanism?

Answer

A

To query information from keys and values from another modality

Question 6

Q

In CLIP, the Image Encoder is most commonly based on which architecture?

Answer

A

ResNet or Vision Transformer (ViT)

Question 7

Q

Which architecture uses masked self-attention to prevent tokens from seeing future tokens during training?

Answer

A

Decoder-only Transformers

Question 8

Q

In multimodal retrieval, cosine similarity is used to:

Answer

A

Compute distances between image and text embeddings

Question 9

Q

Which type of tokenization is typically used in modern LLMs to handle rare words efficiently?

Answer

A

Subword tokenization

Question 10

Q

In Vision Transformers (ViT), the input image is first:

Answer

A

Split into patches and embedded

Question 11

Q

Which of the following are advantages of using word embeddings?
A) Fixed vector length
B) Independent of vocabulary size
C) Direct interpretation of sentence grammar
D) Quantifiable semantic similarity

Answer

A

Fixed vector length; Quantifiable semantic similarity

Question 12

Q

Which layers or blocks are typically involved in the Transformer encoder?
A) Multi-head self-attention
B) Feed-forward networks
C) Positional encoding
D) Convolutional layers

Answer

A

Multi-head self-attention; Feed-forward networks; Positional encoding

Question 13

Q

In the Attention mechanism, which components must have the same dimensionality?
A) Queries
B) Keys
C) Values
D) Embeddings

Answer

A

Queries; Keys

Question 14

Q

Which tasks are mentioned as applications for multimodal transformers?
A) Image Captioning
B) Vision Question Answering (VQA)
C) Text Summarization
D) Speech Recognition

Answer

A

Image Captioning; Vision Question Answering (VQA); Speech Recognition

Question 15

Q

Which of the following are properties required for positional encoding?
A) Unique for each time step
B) Dependent on input length
C) Deterministic
D) Parallelizable

Answer

A

Unique for each time step; Deterministic; Parallelizable

Question 16

Q

Which of the following statements about CLIP are true?
A) It uses contrastive learning to align images and text
B) It uses a shared encoder for text and image
C) It computes cosine similarity between image and text embeddings
D) It requires fine-tuning for every new task

Answer

Study These Flashcards

A

It uses contrastive learning to align images and text; It computes cosine similarity between image and text embeddings

Question 17

Q

Which techniques are used to generate embeddings?
A) CNNs
B) MLPs
C) Tokenizers
D) Autoencoders

Answer

Study These Flashcards

A

CNNs; MLPs; Autoencoders

Question 18

Q

Cross-attention in the decoder is used in which tasks?
A) Speech Recognition
B) Image Captioning
C) Multimodal Retrieval
D) Text Classification

Answer

Study These Flashcards

A

Speech Recognition; Image Captioning

Question 19

Q

In a Vision Transformer (ViT), what operations are applied to image patches?
A) Flattening
B) Linear projection to embedding
C) Sinusoidal positional encoding
D) Token masking

Answer

Study These Flashcards

A

Flattening; Linear projection to embedding; Sinusoidal positional encoding

Question 20

Q

Which are characteristics of Encoder-Decoder architectures?
A) Encoder learns contextual embeddings
B) Decoder predicts next token
C) Only works for vision tasks
D) Can use cross-attention

Answer

Study These Flashcards

A

Encoder learns contextual embeddings; Decoder predicts next token; Can use cross-attention

Multimodal LLMs Flashcards

(20 cards)