DP Attention and TBO for DeepSeek-V4 on MI355X
Summary
ATOM significantly enhances DeepSeek-V4 inference performance on AMD Instinct™ MI355X GPUs through two core optimizations: DP Attention Scheduling and Two-Batch Overlap (TBO) for standard collectives. DP Attention Scheduling, via PrefillDelayer, coordinates prefill admission across Data Parallel (DP) ranks, preventing phase mismatches that cause up to 86% padding waste, eager decode, and dummy prefill in real-world workloads. This ensures ranks remain synchronized. Concurrently, TBO is improved by token-level even splitting for prefill, balancing micro-batches by token count rather than request boundaries, maximizing compute-communication overlap. Crucially, ATOM extends TBO beyond specialized all2all backends to atomic all_gather/reduce_scatter (AG/RS) collectives by strategically placing yield and stream-switch points at collective boundaries. This allows MoE communication to overlap with compute, delivering competitive DeepSeek-V4 throughput on MI355X for the 8K/1K workload, as validated by SemiAnalysis InferenceX benchmarks as of June 18, 2026, offering a simpler, more flexible deployment strategy than Expert Parallel setups.
Key takeaway
For AI Engineers optimizing MoE inference on AMD Instinct GPUs, ATOM's approach offers a compelling alternative to complex Expert Parallel setups. You should consider adopting DP Attention with TBO for standard collectives, as it simplifies deployment by eliminating specialized all2all libraries and expert partitioning. This strategy allows you to achieve competitive DeepSeek-V4 performance on MI355X, leveraging existing hardware and reducing configuration overhead, while maintaining high throughput.
Key insights
ATOM optimizes MoE inference by coordinating DP Attention and overlapping standard collectives with compute.
Principles
- Phase alignment across DP ranks reduces padding waste.
- Token-level splitting balances micro-batches for TBO.
- Overlap communication with compute at collective boundaries.
Method
ATOM uses PrefillDelayer for coordinated DP prefill scheduling and token-level even splitting for TBO. It places yield points at all_gather/reduce_scatter boundaries to interleave communication and compute streams.
In practice
- Deploy MoE models on standard interconnects.
- Reduce configuration complexity for MoE inference.
- Utilize CUDA stream switching for communication overlap.
Topics
- DeepSeek-V4
- AMD Instinct MI355X
- MoE Inference Optimization
- Data Parallel Attention
- Two-Batch Overlap
- Collective Communication
Code references
Best for: AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.