DP Attention and TBO for DeepSeek-V4 on MI355X

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

ATOM significantly enhances DeepSeek-V4 inference performance on AMD Instinct™ MI355X GPUs through two core optimizations: DP Attention Scheduling and Two-Batch Overlap (TBO) for standard collectives. DP Attention Scheduling, via PrefillDelayer, coordinates prefill admission across Data Parallel (DP) ranks, preventing phase mismatches that cause up to 86% padding waste, eager decode, and dummy prefill in real-world workloads. This ensures ranks remain synchronized. Concurrently, TBO is improved by token-level even splitting for prefill, balancing micro-batches by token count rather than request boundaries, maximizing compute-communication overlap. Crucially, ATOM extends TBO beyond specialized all2all backends to atomic all_gather/reduce_scatter (AG/RS) collectives by strategically placing yield and stream-switch points at collective boundaries. This allows MoE communication to overlap with compute, delivering competitive DeepSeek-V4 throughput on MI355X for the 8K/1K workload, as validated by SemiAnalysis InferenceX benchmarks as of June 18, 2026, offering a simpler, more flexible deployment strategy than Expert Parallel setups.

Key takeaway

For AI Engineers optimizing MoE inference on AMD Instinct GPUs, ATOM's approach offers a compelling alternative to complex Expert Parallel setups. You should consider adopting DP Attention with TBO for standard collectives, as it simplifies deployment by eliminating specialized all2all libraries and expert partitioning. This strategy allows you to achieve competitive DeepSeek-V4 performance on MI355X, leveraging existing hardware and reducing configuration overhead, while maintaining high throughput.

Key insights

ATOM optimizes MoE inference by coordinating DP Attention and overlapping standard collectives with compute.

Principles

Method

ATOM uses PrefillDelayer for coordinated DP prefill scheduling and token-level even splitting for TBO. It places yield points at all_gather/reduce_scatter boundaries to interleave communication and compute streams.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.