Multimodal Representation Flashcards by Savo Simeunovic

Q

What is multimodal representation?

A

It is learning representation that reflects cross-modal interactions between individual elements across different modalities.

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What are types of multimodal representations?

A

Fusion: # modalities > # representation

Coordination: # modalities = # representation

Fission: # modalities < # representation

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What is the a basic and what is a raw fusion?

A

When a modality A has the same representation as modality B (both are vectors of the same size for example) and we try to find the merge them into one representation. It is the most common way.

Raw fusion is when two modalities don’t have same representation (not encoded into vector for example) and we try to merge them into one representation. It is a bit more exotic

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

How can a fusion be done with unimodal encoders?

A

Each modality is encoded into separate representation (like embeddings). For example, images are encoded with CNNs with feature extracting, or ViT with patches and attention, or test with BERT.

Then, they are fused together

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What is an early fusion and what is a late fusion?

A

Early fusion is concatenating 2 representation (vectors) of 2 modalities together and then making a prediction.
Late fusion is making 2 separate predictions and then doing like an ensemble where something like an average prediction is computed as a final prediction.

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What is an additive fusion?

A

Simply adding two representations together, not the best since it is like an ensemble, no multiplication between modalities:
z = w1XA + w2XB

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What is a multiplicative fusion? What are types?

A

Instead of doing it linearly, we do it multiplicative. It can be either element-wise, or the outer-priduct (every element of a vector of modality A with every element in modality B). Problem with outer product is that we need an exponentially big matrix W:

Z = W(XA_T * XB)

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

How to optimize the multiplicate fusion?

A

We can assume that the outer product of two vectors will have a low rank (one row can be represented as a linear combination of some other row), and we can use this assumption to not compute the full outer product which means we don’t need to will matrix W

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What is a gated fusion?

A

It can be seen as an attention. Only some of the elements in the representation are relevant (weighted). This is done using gates, kinda like in LSTM. The gates for all modalities are created (values from 0-1), and the gates are created using all modalities. Each gate is multiplied with its modality and then this is the combines fusion representation:

This gate can be soft and hard gate. Hard: 1/0, and soft is a range between 0 and 1. Hard attention is more difficult -> derivative is harder to compute

Z = gA(XA, XB) * XA + gB(XA, XB) * XB

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What to do if we don’t know anything about our modalities and we want to do fusion?

A

We use mixture of fusions. Like an ensemble of fusion representations. We use unimodals, additive, multiplicative etc, and they are weighted (gated) and the final representation is computed as an average of all of them (or just a combiantion)

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

How are fusion representation learned in general?

A

We have two modalities and let’s say the model learns some fusion representation of these modalities. To see how good it is, we can use multimodal autoencoder to now take this fused representation and try to recreate the two modalities and our loss is the difference.

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What are types of representation coordination?

A

There is the strong coordination and partial coordination.

Strong coordination tries to make representations as close are possible and partial tries to make them are far as possible.

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

How does the representation coordination work?

A

We still have fA which is an encoder for modality A and fB for modality B. Now, we have some coordination function that evaluates the distance between two modalities (how similar they are), and this function can be used as a loss.
IMPORTANT: needs paired data, what modality A corresponds to modality B ???? Maybe…

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What are some strong coordination functions?

A

Cosine similarity or kernel similarity functions (linear, polynomial, exponential…)

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What are some partial coordination functions?

A

CCA: Canonical Correlation Analysis

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

What is CCA?

Study These Flashcards

A

Canonical Correlation Analysis: the idea is to maximize the correlation between the elements of the representation. It is not element-wise like the string coordination, but we see it more holistically.

Q

What is DCCAE?

Study These Flashcards

A

Deep Canonically Correlated Autoencoders:
We have two objectives:
- CCA: maximize the correlation between two representations/projections
- Try to reconstruct the modalities from the obtained representation

Q

Explain gated coorindation.

Study These Flashcards

A

Similarly to the gated fusion,Wh we need gates which are calculated using both modalities, and they serve as attention. Gates are vectors of 0-1 values

Q

What is Contrastive learning?

Study These Flashcards

A

We have paired data and we have positive and negative pairs. Usually we label positive and naively say that the rest are negative.

The idea is to bring the positive pairs closer to one another and negative further. The loss would be
max(0, alpha - f(ZA, ZB+) + f(ZA, ZB-))

Where alpha is some threshold we set, and f is some similarity function (like cosine similarity)

Q

What is CLIP? Explain

Study These Flashcards

A

Contrastive Language-Image Pre-training.

The idea is to train on a large amounts of data. We have image-text pairs and we compute a matrix where diagonals are pairs which are positive, and the rest are negative pairs.

Then, what we can do is to a classification of an image by having a set of labels, and we compute the similarity metric between an image and the text “A photo of a {label}”.

CLIP uses a unique loss function based on emtropy called InfoNCE

Q

What is a discriminative approach to fission?

Study These Flashcards

A

We have two modalities and we have 3 encoders where 2 of them use only one modality, and the 3rd encoder uses both modalities. We get 3 representations, get a prediction and done, we get some loss.

A problem is that there is no proper factorization

Q

What is a generative-discriminative approach to fission?

Study These Flashcards

A

In addition to the discriminative approach, we try to decode our representations back into the raw modalities (2nd loss) and also we make sure there is no overlap between modWhatalities by using some priors.

Q

What is entropy?

Study These Flashcards

A

A weighted average of all possible outcomes

Q

What is information?

Study These Flashcards

A

Value of a communicated message
High information is something that is surprising to see (low probability)
Low information is something that is common (high probability)

-log(1/p(x))

How to measure how much information is in a modality?

Using a weighted sum of the whole modality (entropy) (how interesting is my modality)

How to calculate a conditional entropy for 2 modalities?

It is basically looking at one modalitiy knowing the other modality: P(A | B).

How to calculate the mutual information between two modalities?

I(A; B) = H(A) - H(A|B)

How does mutual information between 2 modalities correlate with InfoNCE loss?

- If we optimize this InfoNCE loss, this inequality holds and we arrive at the mutual information + log(N) where N is the number of negative samples. The mutual info is larger than logN - L. That means when I optimize my loss, I will increase the lower bound on the mutual information. If I am doing a good job with contrastive learning, I am getting closer to revealing this mutual information.