AI Basics — Part 4: Multi-modal models and shared latent spaces

When text, images, and sound map to the same space, new abilities appear.

Dec 02, 2025

Once you understand how text and image models learn separately, the next step feels natural: what happens when we train a model that links multiple types of data inside the same latent space?

Multi-modal models learn from text, images, audio, and video together. During training, the system watches how these data types align. A caption sits near the image it describes. A sound sits near the words that match it. A video frame sits near the next frame in the sequence. Over time, these links settle into one shared internal space.

You can picture the same 3D cloud from part 1, but now mixed with more than words.
A patch of pixels floats near the phrase that describes it.

A short sound sits near its transcript.

Motion sits near verbs that capture movement.

A useful way to think about this: it resembles how a child learns. A child sees a dog, hears it bark, hears someone say “dog,” and links shape, sound, and language together through experience. Multi-modal models form these links in a similar way, without anyone programming the associations.

Once this shared space forms, new abilities emerge.
🌐 Text can guide image generation.
🌐 Images can guide text descriptions.
🌐 Audio can trigger visual interpretation.
🌐 Video can produce coherent narratives.

Nobody designs these abilities directly.

They come from the density of connections inside the shared space.

When different kinds of knowledge occupy the same internal world, richer skills emerge from the overlap.

cauri’s substack

Discussion about this post

Ready for more?