Intel® Xeon® 6 Processors and Intel® AMX Deliver More Concurrent Users with NVIDIA HGX B200 Systems

2026-06-25 · Source: Artificial Intelligence (AI) articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

Intel's blog introduces a heterogeneous architecture for large language model (LLM) serving that co-runs vLLMs on both CPUs and GPUs to enhance system efficiency and user concurrency. This approach utilizes Intel Xeon 6 processors with Intel Advanced Matrix Extensions (Intel AMX) alongside NVIDIA HGX B200 systems. The architecture employs a CPU pathway running a Llama-3.1-8B model for lightweight tasks like research extraction and validation, and a GPU pathway running a Llama-3.1-405B model for compute-intensive generation. Performance evaluations on a Supermicro SYS-822GS-NBRT HGX platform, featuring 2 x Intel Xeon 6776P processors and 8 x NVIDIA HGX B200 GPUs, demonstrated up to 1.44x higher user concurrency. This improvement is achieved by continuously engaging both CPU and GPU resources through multi-agent pipelining, where tasks like research, writing, and reviewing overlap, reducing end-to-end latency compared to sequential execution.

Key takeaway

For MLOps Engineers optimizing LLM serving systems, you should consider implementing a heterogeneous CPU-GPU architecture to maximize resource utilization and user concurrency. By offloading lightweight reasoning tasks to Intel Xeon 6 processors with Intel AMX and reserving GPUs for intensive generation, you can achieve up to 1.44x higher concurrency. This approach reduces GPU idle time and stabilizes tail latency, making your production-scale vLLM deployments more efficient and scalable. Evaluate pipelined multi-agent designs to further enhance throughput.

Key insights

Heterogeneous CPU-GPU co-serving with Intel AMX significantly boosts LLM inference concurrency and efficiency.

Principles

Distribute LLM tasks by complexity.
Pipeline multi-agent execution.
Engage all compute resources continuously.

Method

Implement a multi-agent system with CPU-based Researcher/Reviewer agents and a GPU-based Writer agent, enabling pipelined, overlapping execution for reduced latency and improved throughput.

In practice

Use Llama-3.1-8B on CPU for lightweight reasoning.
Deploy Llama-3.1-405B on GPU for complex generation.
Leverage Intel AMX for CPU acceleration.

Topics

LLM Inference
Heterogeneous Architecture
Intel Xeon 6 Processors
Intel AMX
NVIDIA HGX B200
Multi-Agent LLMs

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.