It's Tokens all the Way Down

2025-09-10 · Source: The Computist Journal · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

Generative AI models, initially developed for separate modalities like text, images, and audio, are converging into a single token-based architecture. The core concept, often called a "language model," predicts the next token in a sequence, regardless of its origin. Historically, image generation evolved from GANs, which introduced the latent space concept, to diffusion models. Audio and text processing also had distinct approaches. A pivotal moment was OpenAI's 2021 CLIP model, which aligned text and image embeddings in a shared latent space, enabling text-steered image generation. By the mid-2020s, the prevailing method became tokenizing all inputs—image patches, audio chunks—and feeding them into a unified model that predicts the next token, effectively making the distinction between "language" and "image" models obsolete.

Key takeaway

For AI Engineers designing multimodal systems, recognize that the underlying architecture is converging to a unified token-prediction model. This means you should focus on effective tokenization strategies across modalities rather than separate, bespoke systems. Your efforts in aligning different data types into a common latent space will yield more general and efficient models. Consider how a single sequence-modeling approach simplifies development and deployment.

Key insights

Generative AI is unifying across modalities through a token-based, sequence-prediction architecture.

Principles

Generative models learn data distributions to sample new examples.
Any sequence of distinct symbols can be modeled as a language.
Latent spaces provide compressed, continuous maps of data variations.

Method

Tokenize all modalities (text, image patches, audio chunks) into a single stream. Train a unified model to predict the next token in this combined sequence.

In practice

Utilize CLIP-like models for cross-modal embedding alignment.
Employ diffusion models for robust image generation from noise.

Topics

Generative AI
Multimodal AI
Tokenization
Large Language Models
Diffusion Models
CLIP Model

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Computist Journal.