deepseek-ai / DeepEP

2025-02-17 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

DeepEP is a communication library designed for Mixture-of-Experts (MoE) and expert parallelism (EP), offering high-throughput and low-latency all-to-all GPU kernels for MoE dispatch and combine operations. It supports low-precision FP8 operations and includes kernels optimized for asymmetric-domain bandwidth forwarding, such as NVLink to RDMA, suitable for training and inference prefilling. For latency-sensitive inference decoding, DeepEP provides low-latency kernels utilizing pure RDMA and a hook-based communication-computation overlapping method that avoids occupying Streaming Multiprocessor (SM) resources. Performance tests on H800 GPUs with CX7 InfiniBand 400 Gb/s RDMA show intranode NVLink bandwidths up to 158 GB/s and internode RDMA bandwidths up to 58 GB/s for normal kernels, while low-latency kernels achieve latencies as low as 77 us for 8 experts.

Key takeaway

For MLOps Engineers deploying or training large-scale Mixture-of-Experts models, DeepEP offers specialized kernels and communication-computation overlap to significantly reduce latency and increase throughput. You should consider integrating DeepEP to optimize MoE dispatch and combine operations, especially for latency-sensitive inference decoding, and leverage its traffic isolation and adaptive routing features to fine-tune network performance on your cluster.

Key insights

DeepEP optimizes MoE communication with high-throughput, low-latency kernels and SM-free overlap for training and inference.

Principles

Optimize communication for MoE architectures.
Segregate workloads across InfiniBand Virtual Lanes.
Enable adaptive routing for heavy network loads.

Method

DeepEP employs specialized GPU kernels for MoE dispatch/combine, asymmetric bandwidth forwarding, pure RDMA for low latency, and a hook-based communication-computation overlapping method.

In practice

Use `Buffer.set_num_sms()` to control SM usage.
Set `NVSHMEM_IB_SL` for traffic isolation.
Auto-tune configurations for cluster-specific performance.

Topics

Mixture-of-Experts
Expert Parallelism
GPU Communication Library
NVLink & RDMA Performance
Low-Latency Inference

Code references

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.