Intel® Xeon® 6 Processors and Nutanix Enterprise AI Deliver 2x Higher Throughput on LLM Inferencing

2026-05-12 · Source: Artificial Intelligence (AI) articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

Intel® Xeon® 6 processors, combined with Nutanix Enterprise AI (NAI), achieve up to 2x higher inference throughput for Llama3.1-8B-Instruct workloads compared to 5th Gen Intel® Xeon® processors. This performance was demonstrated on conversational AI applications like Chatbots, meeting a 100ms Time per Output Token (TPOT) SLA. NAI, built on the Nutanix Kubernetes Platform (NKP), streamlines AI model orchestration and deployment by using Virtualized Large Language Models (vLLMs) and Intel® Advanced Matrix Extensions (Intel® AMX) for CPU-based inference. vLLM v0.13.0, optimized with Intel PyTorch Extensions (IPEX), provides high-throughput capabilities through dynamic batching, asynchronous execution, and memory optimization. The benchmark used NAI v2.5 and specific hardware configurations, highlighting efficient scaling and resource utilization for real-time, multi-user AI deployments.

Key takeaway

For MLOps Engineers scaling LLM inference on existing CPU infrastructure, consider Intel Xeon 6 processors with Nutanix Enterprise AI. This combination delivers up to 2x higher throughput for Llama3.1-8B-Instruct workloads, significantly improving efficiency and reducing costs. You can achieve predictable performance and streamlined operations, ensuring high concurrency and rapid response times for your real-time, multi-user AI deployments. Evaluate NAI v2.5 and vLLM v0.13.0 to optimize your CPU-based GenAI serving.

Key insights

Intel Xeon 6 processors with Nutanix Enterprise AI double LLM inference throughput on CPUs, enabling cost-effective GenAI.

Principles

CPU-based inferencing offers a cost-efficient option for GenAI deployment on existing infrastructure.
Intelligent workload scheduling ensures balanced resource utilization and continuous operation for fluctuating AI workloads.
Optimized frameworks like vLLM provide high-throughput inference while maintaining CPU cost-effectiveness.

Method

Nutanix Enterprise AI (NAI) orchestrates and deploys AI models using Virtualized Large Language Models (vLLMs) on Intel Xeon 6 processors, utilizing Intel AMX for matrix math acceleration and IPEX for PyTorch optimization.

In practice

Deploy Llama3.1-8B-Instruct on Intel Xeon 6 for 2x throughput gains.
Utilize vLLM with Intel AMX and IPEX for CPU-optimized LLM serving.
Implement NAI for streamlined AI model deployment and inference management.

Topics

Intel Xeon 6 Processors
Nutanix Enterprise AI
LLM Inference
vLLM Framework
CPU Optimization
Llama 3.1-8B Instruct

Best for: NLP Engineer, CTO, VP of Engineering/Data, MLOps Engineer, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.