Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Together AI's research project, "Road to 5 Million Sequence Length," addresses memory barriers in training large language models with extended context. The project successfully scaled transformer-based models, like a Llama 3B architecture, to 5 million tokens on an 8x H100 GPU node. Key bottlenecks include quadratic computation and linear memory growth. The team combined existing techniques such as Fully Sharded Data Parallelism (FSDP), DeepSpeed Ulysses context parallelism, activation checkpointing, and offloading activations to CPU. They also introduced "Arctic sequence length training" and a novel optimization called "Untitled Ulysses," which further refines context parallelism by chunking attention head computations to reuse buffers and reduce activation memory without significant throughput impact. This approach enables training at unprecedented context lengths.

Key takeaway

For AI Architects and ML Engineers designing or fine-tuning large language models for agentic or video generation applications, understanding and implementing advanced memory optimization techniques is crucial. You should combine sharded data parallelism, context parallelism like DeepSpeed Ulysses, activation checkpointing, and CPU offloading. Consider Together AI's "Untitled Ulysses" approach to push context lengths beyond 3 million tokens, enabling more efficient use of GPU resources for extreme sequence lengths.

Key insights

Achieving 5 million token context in LLMs requires stacking multiple memory optimization techniques.

Principles

Method

Combine FSDP, DeepSpeed Ulysses, activation checkpointing, CPU offloading for activations, Arctic sequence length training, and "Untitled Ulysses" for chunked attention head computation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.