Multimodality in AI: a little déjà-vu?

Imagine you’re in the kitchen, ready to whip up a delicious meal. To successfully prepare what’s in your mind, you’ll need to engage all your senses: sight to check the color and shape of the ingredients, smell to detect the enticing aroma of caramelizing onions (or burnt food, your choice), touch to test your dough’s texture, hearing to listen if the oil is sizzling (a sure sign your pan is hot), and, obviously, taste to adjust the seasoning.

You get the idea, right? If you limited yourself to just sight or smell, you might still manage to cook something, but let’s face it: you’d probably increase your chances of messing up. Combining all these channels (or “modalities”) provides you with a richer view of the situation, seriously boosting your odds of serving up a tasty dish—or at least something edible!

📖 Just as a chef simultaneously uses multiple senses, a multimodal AI model integrates different information sources—like text, images, sound, or even sensor signals. This "multimodality" helps it better understand the digital (or real) world around it, almost as if the algorithm itself had eyes, ears, and occasionally a virtual nose to become more insightful.

Why is this interesting? Because, just like you want to avoid ruining your dish, an AI model aims for maximum efficiency and accuracy. When considering several modalities (image + text, text + audio, etc.), the model accesses a broader array of clues for decision-making or predictions—exactly as your brain integrates sensory info to ensure your food doesn’t burn or end up like old rubber.

In this article, we’ll dive into the world of multimodality in deep learning. First, I’ll explain what’s behind this concept (spoiler: it’s not just piling up different data types). Then, we’ll see how some recent approaches manage to juggle all these information sources, and importantly, we’ll discuss the outcomes achievable through multimodality. As always, I’ll also provide some directions to dig deeper, so you’ll clearly understand why everyone’s talking about this now.

Just two ingredients?

Alignment and fusion. There you go, have a nice day 🙂

❓But what else?

Let’s start with the first step, alignment.

Modalities Alignment

The purpose of alignment is to establish semantic relationships across different modalities. As the name suggests, each descriptor existing across various modalities is aligned within a common space. An example? Aligning subtitles, audio, and video of a movie. If you mess up these modalities, you’ll completely lose track of the plot.

How does this work practically?

In some projects, we directly check if an image matches a text or if a video matches its subtitles—this is known as explicit alignment. We compare modality A (for example, image) to modality B (text), often using similarity matrices. Imagine a large table clearly indicating “Image 1 matches well with Text 3, Image 2 doesn’t match Text 4″—the relationships become very clear.

However, if you don’t explicitly assess this proximity but the network learns it along the way to better solve its task (e.g., caption prediction, translation), we call this implicit alignment. Here, alignment is a by-product of training: the model learns to associate text and images automatically, without explicitly measuring similarity. Although less visually obvious, implicit alignment can be equally or more effective when solving a specific task like generating image descriptions.

📖 In brief: With explicit alignment, you clearly measure modality correspondence using similarity formulas. With implicit alignment, it's a side-effect of another learning process (like translation), so the model learns to align modalities independently—though it's less straightforward to visualize.

Approaches to Alignment

Explicit Alignment

Explicit alignment techniques historically rely on statistical tools. Two iconic techniques are:

  1. Dynamic Time Warping (DTW)

DTW measures the similarity between two sequences (e.g., audio signals or image sequences) by temporally “warping” them. It inserts or deletes elements to match each point of one sequence with another, even if durations differ (Kruskal, 1983).

  1. Canonical Correlation Analysis (CCA)

Proposed by Hotelling (1936), CCA projects two datasets into a shared latent space to maximize correlation between them. Variants like Kernel CCA (KCCA) or Deep CCA extend this method to non-linear relationships, handling complex data through deep neural networks (Andrew et al., 2013).

While effective for direct alignment (e.g., image-text matches), explicit methods may struggle with complex or ambiguous inter-modal relationships.

Implicit Alignment

Implicit alignment methods don’t explicitly measure modality similarity; instead, they integrate alignment within the task itself (classification, translation, etc.). Two main families include:

  • Graphical model-based methods: These represent modality interactions as a graph, facilitating implicit alignment through local embeddings (GNN).
  • Neural network-based methods: Especially attention mechanisms, GANs, and autoencoders. These methods automatically align distributions across modalities.

Modalities Fusion

When working with multiple data types, alignment ensures they “point” to similar concepts, while fusion combines them intelligently into a unified model. The goal: leverage each modality’s strengths while mitigating weaknesses.

Encoder-decoder Approaches

  • Data-level fusion: Combine raw inputs early.
  • Feature-level fusion: Process modalities separately, then fuse extracted features.
  • Model-level fusion: Combine predictions from modality-specific sub-models.

Kernel-based Fusion

Using the kernel trick, data are projected into higher-dimensional spaces, facilitating non-linear relationships between modalities.

Graph-based Fusion

Representing modalities as graph nodes allows efficient fusion through algorithms like Graph Neural Networks (GNNs), useful for incomplete data scenarios.

Attention-based Fusion

Attention mechanisms, central to Transformers, help models focus on relevant modality components, enabling advanced interactions (e.g., image-text question-answering).

Why are alignment and fusion effective?

Aligning and fusing modalities ensures coherence, robustness, and adaptability, improving generalization by capturing deeper inter-modal connections. Applications range from vision-language tasks to emotion recognition, highlighting the indispensability of these techniques.

Conclusion

In short, alignment ensures modalities coherently “communicate,” while fusion ensures their combined output exceeds individual contributions. Together, they enhance AI relevance and robustness.

References

  • Multimodal Alignment and Fusion: A Survey – S. Li, H. Tang
  • An overview of sequence comparison – J. B. Kruskal
  • Relations between two sets of variates – H. Hotelling
  • Deep canonical correlation analysis – G. Andrew et al.
  • Graph-based multimodal embedding – S. Tang et al.
  • Attention is all you need – A. Vaswani et al.