deepseek-ai / DeepEP

· Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

DeepEP is a communication library designed for Mixture-of-Experts (MoE) and expert parallelism (EP), offering high-throughput and low-latency all-to-all GPU kernels for MoE dispatch and combine operations. It supports low-precision FP8 operations and includes kernels optimized for asymmetric-domain bandwidth forwarding, such as NVLink to RDMA, suitable for training and inference prefilling. For latency-sensitive inference decoding, DeepEP provides low-latency kernels utilizing pure RDMA and a hook-based communication-computation overlapping method that avoids occupying Streaming Multiprocessor (SM) resources. Performance tests on H800 GPUs with CX7 InfiniBand 400 Gb/s RDMA show intranode NVLink bandwidths up to 158 GB/s and internode RDMA bandwidths up to 58 GB/s for normal kernels, while low-latency kernels achieve latencies as low as 77 us for 8 experts.

Key takeaway

For MLOps Engineers deploying or training large-scale Mixture-of-Experts models, DeepEP offers specialized kernels and communication-computation overlap to significantly reduce latency and increase throughput. You should consider integrating DeepEP to optimize MoE dispatch and combine operations, especially for latency-sensitive inference decoding, and leverage its traffic isolation and adaptive routing features to fine-tune network performance on your cluster.

Key insights

DeepEP optimizes MoE communication with high-throughput, low-latency kernels and SM-free overlap for training and inference.

Principles

Method

DeepEP employs specialized GPU kernels for MoE dispatch/combine, asymmetric bandwidth forwarding, pure RDMA for low latency, and a hook-based communication-computation overlapping method.

In practice

Topics

Code references

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.