Multimodal AI Models
💬 Overview
Traditionally, AI models have focused on a single modality. Language models like ChatGPT handle text, while others process images, audio, or video separately. For AI to become more natural and capable, it needs to integrate multiple types of input, much like humans do. A human assistant can listen, read, and see to make sense of context. This post gives a high‑level look at how multimodal AI systems work, key research milestones, and some real‑world applications.
💡 Example Applications
Creative Generation (Text ⇄ Image/Video/Audio)
- Prototype a video game using placeholder assets generated from descriptions.
- Extract key frames from a video clip and turn them into a stylized comic strip.
Multimodal Assistants (Chat + Vision + Voice)
- Point your camera at a foreign‑language menu and ask, “What are popular lunch choices in Austria?”
- Enhance AI tutoring capabilities by generating visual explanations of STEM concepts.
Cross‑Modal Search
- Show a product photo and say, “Same style but blue and under $100.”
- Say some lyrics to find the corresponding song.
How Models Handle Different Types of Input
Multimodal models must convert every input (text, images, audio, or video) into numeric form, but each modality follows its own encoding path. Let’s walk through how each type is formatted.
🤖 Encoding Each Modality
Text: Tokenization → Embedding
- Words or characters are split into tokens and passed through an embedding layer that maps each token to a numeric vector (think of a learned dictionary of vectors).
- These vectors capture semantic meaning; large language models manipulate them and eventually map the results back to natural‑language output.
Audio: Waveform → Spectrogram → Embedding
- The 1‑D audio signal is converted to a spectrogram (a time‑frequency image of pitch and energy).
- A neural encoder turns each time slice into an audio embedding that preserves tone, inflection, and timbre.
- Text transcripts and audio embeddings complement one another: transcripts clarify speech masked by noise, while tone can resolve ambiguous text.
Images: Patch / Region Encoding
- A common approach is to divide images into patches (e.g., a 16 × 16 grid) or segment them into regions such as foreground objects versus background.
- For each patch, an encoder extracts edges, shapes, and colors, producing a visual embedding.
- Collectively, these vectors summarize the scene (e.g., “dog, green grass, blue sky”).
Video: Frames + Time Modeling
- A video is a sequence of frames (often with an audio track). Key frames are sampled and passed through the image encoder.
- Temporal models such as 3D CNNs or Transformers that attend over time capture motion and continuity across frames.
- Audio from the clip is processed via the audio pipeline above. The result is a time‑aligned stream of visual and audio embeddings representing the entire clip.
🧩 Putting It Together: Unifying Modalities
No matter the modality, the key idea is that the input information becomes a bunch of vectors (lists of numbers). These vectors encode the information from the input-whether that is the semantic content of a sentence, the objects in an image, the frequencies in audio, or the events in a video. Once we have these embeddings, a multimodal model needs to integrate them so it can reason about them together.
🥪 Techniques for Combining Information Across Modalities
Unified Sequence (Token Fusion)
- Non‑text inputs are turned into “pseudo‑tokens.” An image encoder, for instance, outputs a sequence of patch embeddings that are prepended into the text token stream of a decoder‑only Transformer.
- Requires minimal architectural changes. GPT‑style models can ingest the longer sequence unchanged (e.g., GPT‑4‑V, LLaVA).
- Best for: Captioning, simple image Q&A, rapid prototyping with an existing LLM.
Cross‑Modal Attention / Two‑Stream Fusion
- Separate encoders process text and vision (or audio); cross‑attention layers let each stream query the other, or a dedicated fusion block merges them.
- Requires custom architecture and heavier compute when fusion occurs at multiple layers.
- Best for: Tasks needing tight coupling, like visual dialog or referring‑expression grounding.
Dual Encoders with Late Fusion (Contrastive Alignment)
- Two independent encoders map each modality into a shared embedding space, trained with a contrastive loss so matching pairs land close together (e.g., CLIP).
- Fast similarity search, ideal for scanning millions of image-text pairs.
- Best for: Retrieval, ranking, recommendation.
🧩 Putting It Together: Technique Mix
Modern systems often mix these techniques. For example, a CLIP dual encoder creates vision embeddings that are then fed to an LLM as unified tokens or via a cross‑attention adapter.
Timeline
- 28 Jun Multimodal Deep Learning
- 06 Aug ViLBERT
- 24 Feb Zero‑Shot Text‑to‑Image Generation
- 26 Feb Learning Transferable Visual Models From Natural Language Supervision
- 20 Dec High‑Resolution Image Synthesis with Latent Diffusion Models
- 13 Apr Hierarchical Text‑Conditional Image Generation with CLIP Latents
- 29 Apr Flamingo: A Visual Language Model for Few‑Shot Learning
- 23 May Photorealistic Text‑to‑Image Diffusion Models with Deep Language Understanding
- 26 Jan MusicLM: Generating Music From Text
- 27 Feb Language Is Not All You Need: Aligning Perception with Language Models
- 06 Mar PaLM‑E: An Embodied Multimodal Language Model
- 15 Mar GPT‑4 Technical Report
- 17 Apr Visual Instruction Tuning (LLaVA)
- 09 May ImageBind: One Embedding Space to Bind Them All
- 11 Sep NExT‑GPT: Any‑to‑Any Multimodal LLM
📦 Closing Thoughts
What used to be a “cool, but sandboxed” multimodal demo is fast becoming an everyday utility. As models learn to weave together text, images, audio, and video, they become:
- More natural: Interacting the way humans do, across sight, sound, and language.
- More capable: Creating new media and grounding language in the physical world.
- More demanding: Requiring careful data curation, alignment, and increased compute.
Though this is only a very broad overview, I hope it helps convey the core concepts behind multimodal models and their growing significance.