Ulysses Sequence Parallelism: Training with Million-Token Contexts

2025-09-11 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

Ulysses Sequence Parallelism (SP), part of Snowflake AI Research's Arctic Long Sequence Training (ALST) protocol, addresses the memory challenges of training large language models with million-token contexts. Published on March 9, 2026, this method distributes attention computation across multiple GPUs by partitioning attention heads. While standard attention scales quadratically with sequence length, Ulysses SP enables processing sequences up to 96,000 tokens on 4x H100 80GB GPUs, a 12x increase over baselines, by reducing per-GPU memory usage by 3.3x. It integrates seamlessly with Hugging Face Accelerate, Transformers Trainer, and TRL's SFTTrainer, handling sequence sharding and weighted loss aggregation automatically. Benchmarks show that SP=4 at 64K tokens achieves 3.7x the throughput of a single-GPU baseline, processing 13,396 tokens/second.

Key takeaway

For MLOps Engineers and Deep Learning Engineers building or fine-tuning large language models for long-context tasks, adopting Ulysses Sequence Parallelism is critical. It allows training with sequences up to 96,000 tokens on 4x H100 80GB GPUs, which is essential for document analysis or complex reasoning. You should integrate Ulysses SP via Hugging Face Accelerate, Transformers Trainer, or TRL's SFTTrainer, ensuring your sequence length is divisible by `sp_size` and considering Flash Attention 2/3 for optimal performance.

Key insights

Ulysses SP enables training LLMs with million-token contexts by distributing attention heads across GPUs, significantly reducing memory and boosting throughput.

Principles

Attention heads are independent and can be computed separately.
Sequence length divisibility by `sp_size` is crucial for efficiency.
Combining SP with ZeRO Stage 3 further optimizes memory for large models.

Method

Ulysses SP shards input sequences, computes QKV projections locally, uses all-to-all communication to redistribute data for head-partitioned attention, computes local attention, and then reverses redistribution for output projection.

In practice

Use `sp_attn_implementation="flash_attention_2"` for Ampere GPUs.
Combine Ulysses with DeepSpeed ZeRO Stage 3 for very large models.
Balance `sp_size` and `dp_shard_size` for 2D parallelism.

Topics

Sequence Parallelism
Long-Context Training
Attention Mechanism
DeepSpeed
Hugging Face Ecosystem

Code references

Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.