Multimodal AI Models

Posted Jul 22, 2025 Updated Aug 1, 2025

By Zak Nimon

5 min read

💬 Overview

Traditionally, AI models have focused on a single modality. Language models like ChatGPT handle text, while others process images, audio, or video separately. For AI to become more natural and capable, it needs to integrate multiple types of input, much like humans do. A human assistant can listen, read, and see to make sense of context. This post gives a high‑level look at how multimodal AI systems work, key research milestones, and some real‑world applications.

💡 Example Applications

Creative Generation (Text ⇄ Image/Video/Audio)

Prototype a video game using placeholder assets generated from descriptions.
Extract key frames from a video clip and turn them into a stylized comic strip.

Multimodal Assistants (Chat + Vision + Voice)

Point your camera at a foreign‑language menu and ask, “What are popular lunch choices in Austria?”
Enhance AI tutoring capabilities by generating visual explanations of STEM concepts.

Cross‑Modal Search

Show a product photo and say, “Same style but blue and under $100.”
Say some lyrics to find the corresponding song.

How Models Handle Different Types of Input

Multimodal models must convert every input (text, images, audio, or video) into numeric form, but each modality follows its own encoding path. Let’s walk through how each type is formatted.

🤖 Encoding Each Modality

Text: Tokenization → Embedding

Words or characters are split into tokens and passed through an embedding layer that maps each token to a numeric vector (think of a learned dictionary of vectors).
These vectors capture semantic meaning; large language models manipulate them and eventually map the results back to natural‑language output.

Audio: Waveform → Spectrogram → Embedding

The 1‑D audio signal is converted to a spectrogram (a time‑frequency image of pitch and energy).
A neural encoder turns each time slice into an audio embedding that preserves tone, inflection, and timbre.
Text transcripts and audio embeddings complement one another: transcripts clarify speech masked by noise, while tone can resolve ambiguous text.

Images: Patch / Region Encoding

A common approach is to divide images into patches (e.g., a 16 × 16 grid) or segment them into regions such as foreground objects versus background.
For each patch, an encoder extracts edges, shapes, and colors, producing a visual embedding.
Collectively, these vectors summarize the scene (e.g., “dog, green grass, blue sky”).

Video: Frames + Time Modeling

A video is a sequence of frames (often with an audio track). Key frames are sampled and passed through the image encoder.
Temporal models such as 3D CNNs or Transformers that attend over time capture motion and continuity across frames.
Audio from the clip is processed via the audio pipeline above. The result is a time‑aligned stream of visual and audio embeddings representing the entire clip.

🧩 Putting It Together: Unifying Modalities

No matter the modality, the key idea is that the input information becomes a bunch of vectors (lists of numbers). These vectors encode the information from the input-whether that is the semantic content of a sentence, the objects in an image, the frequencies in audio, or the events in a video. Once we have these embeddings, a multimodal model needs to integrate them so it can reason about them together.

🥪 Techniques for Combining Information Across Modalities

Unified Sequence (Token Fusion)

Non‑text inputs are turned into “pseudo‑tokens.” An image encoder, for instance, outputs a sequence of patch embeddings that are prepended into the text token stream of a decoder‑only Transformer.
Requires minimal architectural changes. GPT‑style models can ingest the longer sequence unchanged (e.g., GPT‑4‑V, LLaVA).
Best for: Captioning, simple image Q&A, rapid prototyping with an existing LLM.

Cross‑Modal Attention / Two‑Stream Fusion

Separate encoders process text and vision (or audio); cross‑attention layers let each stream query the other, or a dedicated fusion block merges them.
Requires custom architecture and heavier compute when fusion occurs at multiple layers.
Best for: Tasks needing tight coupling, like visual dialog or referring‑expression grounding.

Dual Encoders with Late Fusion (Contrastive Alignment)

Two independent encoders map each modality into a shared embedding space, trained with a contrastive loss so matching pairs land close together (e.g., CLIP).
Fast similarity search, ideal for scanning millions of image-text pairs.
Best for: Retrieval, ranking, recommendation.

🧩 Putting It Together: Technique Mix

Modern systems often mix these techniques. For example, a CLIP dual encoder creates vision embeddings that are then fed to an LLM as unified tokens or via a cross‑attention adapter.

Timeline

2011

28 Jun Multimodal Deep Learning

2012

03 Dec Multimodal Learning with Deep Boltzmann Machines

2015

07 Jun Show and Tell: A Neural Image Caption Generator
03 May VQA: Visual Question Answering

2016

17 May Generative Adversarial Text to Image Synthesis

2019

06 Aug ViLBERT

2021

2022

2023

26 Jan MusicLM: Generating Music From Text
27 Feb Language Is Not All You Need: Aligning Perception with Language Models
06 Mar PaLM‑E: An Embodied Multimodal Language Model
15 Mar GPT‑4 Technical Report
17 Apr Visual Instruction Tuning (LLaVA)
09 May ImageBind: One Embedding Space to Bind Them All
11 Sep NExT‑GPT: Any‑to‑Any Multimodal LLM

2024

15 Feb Gemini 1.5 (Pro): 1 M‑token multimodal context

2025

📦 Closing Thoughts

What used to be a “cool, but sandboxed” multimodal demo is fast becoming an everyday utility. As models learn to weave together text, images, audio, and video, they become:

More natural: Interacting the way humans do, across sight, sound, and language.
More capable: Creating new media and grounding language in the physical world.
More demanding: Requiring careful data curation, alignment, and increased compute.

Though this is only a very broad overview, I hope it helps convey the core concepts behind multimodal models and their growing significance.

☕ Buy me a coffee

Machine Learning

ML Machine Learning

This post is licensed under CC BY 4.0 by the author.

💬 Overview

💡 Example Applications

How Models Handle Different Types of Input

🤖 Encoding Each Modality

Text: Tokenization → Embedding

Audio: Waveform → Spectrogram → Embedding

Images: Patch / Region Encoding

Video: Frames + Time Modeling