Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines
Summary
Zyphra has introduced Tensor and Sequence Parallelism (TSP), a novel hardware-aware training and inference strategy designed to address GPU memory bottlenecks in long-context transformer models. Traditional methods like Tensor Parallelism (TP) shard parameters but not activations, while Sequence Parallelism (SP) shards tokens but not parameters. The combined TP+SP approach requires T.Σ GPUs per model replica, often leading to slow inter-node communication. TSP folds both tensor and sequence parallelism onto a single mesh axis, allowing each GPU to handle 1/D of model weights and 1/D of the token sequence simultaneously. This approach significantly reduces memory usage, achieving 38.8 GB/GPU with TSP compared to 70.0 GB/GPU for TP and 85-140 GB/GPU for TP+SP on MI300X GPUs at 128K context with 8 GPUs. At 1,024 GPUs and 128K sequence length, TSP delivers 173M tokens/sec, a 2.6x throughput improvement over TP+SP's 66M tokens/sec.
Key takeaway
For MLOps Engineers and AI Engineers optimizing large language model training and inference, Zyphra's TSP offers a compelling solution to GPU memory constraints. If your workloads involve long contexts or moderate batch sizes (BS > 8h), adopting TSP can significantly boost throughput by up to 2.6x and reduce per-GPU memory footprint, making larger models feasible on existing hardware. Evaluate TSP for your transformer-based systems to enhance efficiency and scalability.
Key insights
TSP simultaneously shards model weights and token sequences, resolving GPU memory bottlenecks for long-context transformers.
Principles
- Fold orthogonal parallelism axes onto one.
- Pipeline weight transfers behind GEMM compute.
Method
TSP assigns 1/D of model weights and 1/D of the token sequence to each GPU. Attention involves broadcasting packed weight shards and all-gathering K/V. Gated MLP uses a point-to-point ring for weight rotation and local accumulation.
In practice
- Achieves 2.6x throughput over TP+SP.
- Reduces memory to 38.8 GB/GPU at 128K context.
- Wins when batch size > 8h.
Topics
- Tensor and Sequence Parallelism
- Transformer Training
- GPU Memory Optimization
- Long-Context Models
- Parallel Computing
Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.