Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI
Summary
Together AI's research project, "Road to 5 Million Sequence Length," addresses memory barriers in training large language models with extended context. The project successfully scaled transformer-based models, like a Llama 3B architecture, to 5 million tokens on an 8x H100 GPU node. Key bottlenecks include quadratic computation and linear memory growth. The team combined existing techniques such as Fully Sharded Data Parallelism (FSDP), DeepSpeed Ulysses context parallelism, activation checkpointing, and offloading activations to CPU. They also introduced "Arctic sequence length training" and a novel optimization called "Untitled Ulysses," which further refines context parallelism by chunking attention head computations to reuse buffers and reduce activation memory without significant throughput impact. This approach enables training at unprecedented context lengths.
Key takeaway
For AI Architects and ML Engineers designing or fine-tuning large language models for agentic or video generation applications, understanding and implementing advanced memory optimization techniques is crucial. You should combine sharded data parallelism, context parallelism like DeepSpeed Ulysses, activation checkpointing, and CPU offloading. Consider Together AI's "Untitled Ulysses" approach to push context lengths beyond 3 million tokens, enabling more efficient use of GPU resources for extreme sequence lengths.
Key insights
Achieving 5 million token context in LLMs requires stacking multiple memory optimization techniques.
Principles
- Memory bottlenecks appear unexpectedly in long context training.
- Context parallelism can be further optimized by chunking attention heads.
- Recomputing activations reduces memory footprint significantly.
Method
Combine FSDP, DeepSpeed Ulysses, activation checkpointing, CPU offloading for activations, Arctic sequence length training, and "Untitled Ulysses" for chunked attention head computation.
In practice
- Use PyTorch profiler to identify memory bottlenecks.
- Implement DeepSpeed Ulysses for attention parallelism.
- Offload non-critical activations to CPU.
Topics
- Long Context LLMs
- Context Parallelism
- DeepSpeed Ulysses
- Activation Checkpointing
- Memory Optimization
- Together AI
- Transformer Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.