Introduction Flashcards
Lecture 1 (11 cards)
What is a modality?
It is a way in which something is expressed or perceived (sensory modality: vision, touch…). The closer you are to the sensor, the modalities get more detailed/raw (image pixels, speech signals..) and further you are/more processed they are, more abstract the modalities become (object categories, sentiment etc.)
What is multimodal?
It is a science of heterogeneous and interconnected data. We want modalities to have some connection between them, but also that they are different (heterogeneous) so they give different signals to the model.
What is a difference between homogeneous and heterogeneous modalities?
Homogeneous modalities have more similar properties (2 images from the same camera), and as the modalities get more different, they become more heterogoneous (text from 2 different languages, and then language and vision…).
What are different way 2 modalities can be related?
Association: For example, correlation. Whenever someone says ‘boom’, you see an explosion
Dependency: Some modality causes an another one for example
Correspondence: grounding, picture of a laptop and a word laptop correspond to each other
Relationship: some modality is used for some other one
How can modality elements interact during the inference?
What is the representation challenge of multimodal AI?
How do you represent the multi-modal interactions between individual elements across different modalities? There are 3 ways to do this:
-Fusion: combining elements of different modalities into fewer number of elements (think combining 2 vectors into 1)
- Coordination: one element influences another one and vice versa
- Fission: Having more vectors/elements, trying to find hidden structures/information
What is the alignment challenge of multimodal AI?
Identifying and modeling cross-modal connections between all
elements of multiple modalities, building from the data structure.
Modalities have different internal structures (pixels/sound/tokens) and we should somehow tell which one is interacting with which one (align them). There are 3 ways of doing this:
- Discrete: grounding, word laptop refers to this bounding box in the image
- Continuous: A cow makes a sound and we should say that this sound refers to this part of the image where cow is
- Contextualized: modality elements are influencing each other (contextualizing). Think transformers and attention
What is the reasoning challenge of multimodal AI?
Combining knowledge, usually through multiple inferential steps,
exploiting multimodal alignment and problem structure.
We want to explain human why the model came to the result, also incorporating some external knowledge. This reasoning could be in different forms
What is the generation challenge of multimodal AI?
Learning a generative process to produce raw modalities that
reflects cross-modal interactions, structure and coherence
- Summarization: reduce the size (summarize this text, produce less tokens than you are given)
- Translation: change modalities
- Creation: Given some modalities, create different one, or a combination of few modalities
What is the transference challenge of multimodal AI?
Transfer knowledge between modalities, usually to help the
target modality which may be noisy or with limited resources.
We have modality A which might not be useful on its own, maybe not enough, or it is noisy. Then, during the training, we add modality B and in the end, we get the enriched modality A (knowledge transfer)
What is the representation challenge of multimodal AI?
Transfer knowledge between modalities, usually to help the target
modality which may be noisy or with limited resources