Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

2026-06-08 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Together AI's research project, "Road to 5 Million Sequence Length," addresses memory barriers in training large language models with extended context. The project successfully scaled transformer-based models, like a Llama 3B architecture, to 5 million tokens on an 8x H100 GPU node. Key bottlenecks include quadratic computation and linear memory growth. The team combined existing techniques such as Fully Sharded Data Parallelism (FSDP), DeepSpeed Ulysses context parallelism, activation checkpointing, and offloading activations to CPU. They also introduced "Arctic sequence length training" and a novel optimization called "Untitled Ulysses," which further refines context parallelism by chunking attention head computations to reuse buffers and reduce activation memory without significant throughput impact. This approach enables training at unprecedented context lengths.

Key takeaway

For AI Architects and ML Engineers designing or fine-tuning large language models for agentic or video generation applications, understanding and implementing advanced memory optimization techniques is crucial. You should combine sharded data parallelism, context parallelism like DeepSpeed Ulysses, activation checkpointing, and CPU offloading. Consider Together AI's "Untitled Ulysses" approach to push context lengths beyond 3 million tokens, enabling more efficient use of GPU resources for extreme sequence lengths.

Key insights

Achieving 5 million token context in LLMs requires stacking multiple memory optimization techniques.

Principles

Memory bottlenecks appear unexpectedly in long context training.
Context parallelism can be further optimized by chunking attention heads.
Recomputing activations reduces memory footprint significantly.

Method

Combine FSDP, DeepSpeed Ulysses, activation checkpointing, CPU offloading for activations, Arctic sequence length training, and "Untitled Ulysses" for chunked attention head computation.

In practice

Use PyTorch profiler to identify memory bottlenecks.
Implement DeepSpeed Ulysses for attention parallelism.
Offload non-critical activations to CPU.

Topics

Long Context LLMs
Context Parallelism
DeepSpeed Ulysses
Activation Checkpointing
Memory Optimization
Together AI
Transformer Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.