Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines

2026-05-04 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

Zyphra has introduced Tensor and Sequence Parallelism (TSP), a novel hardware-aware training and inference strategy designed to address GPU memory bottlenecks in long-context transformer models. Traditional methods like Tensor Parallelism (TP) shard parameters but not activations, while Sequence Parallelism (SP) shards tokens but not parameters. The combined TP+SP approach requires T.Σ GPUs per model replica, often leading to slow inter-node communication. TSP folds both tensor and sequence parallelism onto a single mesh axis, allowing each GPU to handle 1/D of model weights and 1/D of the token sequence simultaneously. This approach significantly reduces memory usage, achieving 38.8 GB/GPU with TSP compared to 70.0 GB/GPU for TP and 85-140 GB/GPU for TP+SP on MI300X GPUs at 128K context with 8 GPUs. At 1,024 GPUs and 128K sequence length, TSP delivers 173M tokens/sec, a 2.6x throughput improvement over TP+SP's 66M tokens/sec.

Key takeaway

For MLOps Engineers and AI Engineers optimizing large language model training and inference, Zyphra's TSP offers a compelling solution to GPU memory constraints. If your workloads involve long contexts or moderate batch sizes (BS > 8h), adopting TSP can significantly boost throughput by up to 2.6x and reduce per-GPU memory footprint, making larger models feasible on existing hardware. Evaluate TSP for your transformer-based systems to enhance efficiency and scalability.

Key insights

TSP simultaneously shards model weights and token sequences, resolving GPU memory bottlenecks for long-context transformers.

Principles

Fold orthogonal parallelism axes onto one.
Pipeline weight transfers behind GEMM compute.

Method

TSP assigns 1/D of model weights and 1/D of the token sequence to each GPU. Attention involves broadcasting packed weight shards and all-gathering K/V. Gated MLP uses a point-to-point ring for weight rotation and local accumulation.

In practice

Achieves 2.6x throughput over TP+SP.
Reduces memory to 38.8 GB/GPU at 128K context.
Wins when batch size > 8h.

Topics

Tensor and Sequence Parallelism
Transformer Training
GPU Memory Optimization
Long-Context Models
Parallel Computing

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.