Multimodal Representation Flashcards
(28 cards)
What is multimodal representation?
It is learning representation that reflects cross-modal interactions between individual elements across different modalities.
What are types of multimodal representations?
Fusion: # modalities > # representation
Coordination: # modalities = # representation
Fission: # modalities < # representation
What is the a basic and what is a raw fusion?
When a modality A has the same representation as modality B (both are vectors of the same size for example) and we try to find the merge them into one representation. It is the most common way.
Raw fusion is when two modalities don’t have same representation (not encoded into vector for example) and we try to merge them into one representation. It is a bit more exotic
How can a fusion be done with unimodal encoders?
Each modality is encoded into separate representation (like embeddings). For example, images are encoded with CNNs with feature extracting, or ViT with patches and attention, or test with BERT.
Then, they are fused together
What is an early fusion and what is a late fusion?
Early fusion is concatenating 2 representation (vectors) of 2 modalities together and then making a prediction.
Late fusion is making 2 separate predictions and then doing like an ensemble where something like an average prediction is computed as a final prediction.
What is an additive fusion?
Simply adding two representations together, not the best since it is like an ensemble, no multiplication between modalities:
z = w1XA + w2XB
What is a multiplicative fusion? What are types?
Instead of doing it linearly, we do it multiplicative. It can be either element-wise, or the outer-priduct (every element of a vector of modality A with every element in modality B). Problem with outer product is that we need an exponentially big matrix W:
Z = W(XA_T * XB)
How to optimize the multiplicate fusion?
We can assume that the outer product of two vectors will have a low rank (one row can be represented as a linear combination of some other row), and we can use this assumption to not compute the full outer product which means we don’t need to will matrix W
What is a gated fusion?
It can be seen as an attention. Only some of the elements in the representation are relevant (weighted). This is done using gates, kinda like in LSTM. The gates for all modalities are created (values from 0-1), and the gates are created using all modalities. Each gate is multiplied with its modality and then this is the combines fusion representation:
This gate can be soft and hard gate. Hard: 1/0, and soft is a range between 0 and 1. Hard attention is more difficult -> derivative is harder to compute
Z = gA(XA, XB) * XA + gB(XA, XB) * XB
What to do if we don’t know anything about our modalities and we want to do fusion?
We use mixture of fusions. Like an ensemble of fusion representations. We use unimodals, additive, multiplicative etc, and they are weighted (gated) and the final representation is computed as an average of all of them (or just a combiantion)
How are fusion representation learned in general?
We have two modalities and let’s say the model learns some fusion representation of these modalities. To see how good it is, we can use multimodal autoencoder to now take this fused representation and try to recreate the two modalities and our loss is the difference.
What are types of representation coordination?
There is the strong coordination and partial coordination.
Strong coordination tries to make representations as close are possible and partial tries to make them are far as possible.
How does the representation coordination work?
We still have fA which is an encoder for modality A and fB for modality B. Now, we have some coordination function that evaluates the distance between two modalities (how similar they are), and this function can be used as a loss.
IMPORTANT: needs paired data, what modality A corresponds to modality B ???? Maybe…
What are some strong coordination functions?
Cosine similarity or kernel similarity functions (linear, polynomial, exponential…)
What are some partial coordination functions?
CCA: Canonical Correlation Analysis
What is CCA?
Canonical Correlation Analysis: the idea is to maximize the correlation between the elements of the representation. It is not element-wise like the string coordination, but we see it more holistically.
What is DCCAE?
Deep Canonically Correlated Autoencoders:
We have two objectives:
- CCA: maximize the correlation between two representations/projections
- Try to reconstruct the modalities from the obtained representation
Explain gated coorindation.
Similarly to the gated fusion,Wh we need gates which are calculated using both modalities, and they serve as attention. Gates are vectors of 0-1 values
What is Contrastive learning?
- We have paired data and we have positive and negative pairs. Usually we label positive and naively say that the rest are negative.
The idea is to bring the positive pairs closer to one another and negative further. The loss would be
max(0, alpha - f(ZA, ZB+) + f(ZA, ZB-))
Where alpha is some threshold we set, and f is some similarity function (like cosine similarity)
What is CLIP? Explain
Contrastive Language-Image Pre-training.
The idea is to train on a large amounts of data. We have image-text pairs and we compute a matrix where diagonals are pairs which are positive, and the rest are negative pairs.
Then, what we can do is to a classification of an image by having a set of labels, and we compute the similarity metric between an image and the text “A photo of a {label}”.
CLIP uses a unique loss function based on emtropy called InfoNCE
What is a discriminative approach to fission?
We have two modalities and we have 3 encoders where 2 of them use only one modality, and the 3rd encoder uses both modalities. We get 3 representations, get a prediction and done, we get some loss.
A problem is that there is no proper factorization
What is a generative-discriminative approach to fission?
In addition to the discriminative approach, we try to decode our representations back into the raw modalities (2nd loss) and also we make sure there is no overlap between modWhatalities by using some priors.
What is entropy?
A weighted average of all possible outcomes
What is information?
Value of a communicated message
High information is something that is surprising to see (low probability)
Low information is something that is common (high probability)
-log(1/p(x))