deepseek-ai / DeepEP
Summary
DeepEP is a communication library designed for Mixture-of-Experts (MoE) and expert parallelism (EP), offering high-throughput and low-latency all-to-all GPU kernels for MoE dispatch and combine operations. It supports low-precision FP8 operations and includes kernels optimized for asymmetric-domain bandwidth forwarding, such as NVLink to RDMA, suitable for training and inference prefilling. For latency-sensitive inference decoding, DeepEP provides low-latency kernels utilizing pure RDMA and a hook-based communication-computation overlapping method that avoids occupying Streaming Multiprocessor (SM) resources. Performance tests on H800 GPUs with CX7 InfiniBand 400 Gb/s RDMA show intranode NVLink bandwidths up to 158 GB/s and internode RDMA bandwidths up to 58 GB/s for normal kernels, while low-latency kernels achieve latencies as low as 77 us for 8 experts.
Key takeaway
For MLOps Engineers deploying or training large-scale Mixture-of-Experts models, DeepEP offers specialized kernels and communication-computation overlap to significantly reduce latency and increase throughput. You should consider integrating DeepEP to optimize MoE dispatch and combine operations, especially for latency-sensitive inference decoding, and leverage its traffic isolation and adaptive routing features to fine-tune network performance on your cluster.
Key insights
DeepEP optimizes MoE communication with high-throughput, low-latency kernels and SM-free overlap for training and inference.
Principles
- Optimize communication for MoE architectures.
- Segregate workloads across InfiniBand Virtual Lanes.
- Enable adaptive routing for heavy network loads.
Method
DeepEP employs specialized GPU kernels for MoE dispatch/combine, asymmetric bandwidth forwarding, pure RDMA for low latency, and a hook-based communication-computation overlapping method.
In practice
- Use `Buffer.set_num_sms()` to control SM usage.
- Set `NVSHMEM_IB_SL` for traffic isolation.
- Auto-tune configurations for cluster-specific performance.
Topics
- Mixture-of-Experts
- Expert Parallelism
- GPU Communication Library
- NVLink & RDMA Performance
- Low-Latency Inference
Code references
- deepseek-ai/DeepSeek-V3
- deepseek-ai/DeepEP
- ROCm/mori
- uccl-project/uccl
- Infrawaves/DeepEP_ibrc_dual-ports_multiQP
Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.