Intel® Xeon® 6 Processors and Intel® AMX Deliver More Concurrent Users with NVIDIA HGX B200 Systems
Summary
Intel's blog introduces a heterogeneous architecture for large language model (LLM) serving that co-runs vLLMs on both CPUs and GPUs to enhance system efficiency and user concurrency. This approach utilizes Intel Xeon 6 processors with Intel Advanced Matrix Extensions (Intel AMX) alongside NVIDIA HGX B200 systems. The architecture employs a CPU pathway running a Llama-3.1-8B model for lightweight tasks like research extraction and validation, and a GPU pathway running a Llama-3.1-405B model for compute-intensive generation. Performance evaluations on a Supermicro SYS-822GS-NBRT HGX platform, featuring 2 x Intel Xeon 6776P processors and 8 x NVIDIA HGX B200 GPUs, demonstrated up to 1.44x higher user concurrency. This improvement is achieved by continuously engaging both CPU and GPU resources through multi-agent pipelining, where tasks like research, writing, and reviewing overlap, reducing end-to-end latency compared to sequential execution.
Key takeaway
For MLOps Engineers optimizing LLM serving systems, you should consider implementing a heterogeneous CPU-GPU architecture to maximize resource utilization and user concurrency. By offloading lightweight reasoning tasks to Intel Xeon 6 processors with Intel AMX and reserving GPUs for intensive generation, you can achieve up to 1.44x higher concurrency. This approach reduces GPU idle time and stabilizes tail latency, making your production-scale vLLM deployments more efficient and scalable. Evaluate pipelined multi-agent designs to further enhance throughput.
Key insights
Heterogeneous CPU-GPU co-serving with Intel AMX significantly boosts LLM inference concurrency and efficiency.
Principles
- Distribute LLM tasks by complexity.
- Pipeline multi-agent execution.
- Engage all compute resources continuously.
Method
Implement a multi-agent system with CPU-based Researcher/Reviewer agents and a GPU-based Writer agent, enabling pipelined, overlapping execution for reduced latency and improved throughput.
In practice
- Use Llama-3.1-8B on CPU for lightweight reasoning.
- Deploy Llama-3.1-405B on GPU for complex generation.
- Leverage Intel AMX for CPU acceleration.
Topics
- LLM Inference
- Heterogeneous Architecture
- Intel Xeon 6 Processors
- Intel AMX
- NVIDIA HGX B200
- Multi-Agent LLMs
Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.