Intel® Xeon® 6 Processors and Nutanix Enterprise AI Deliver 2x Higher Throughput on LLM Inferencing
Summary
Intel® Xeon® 6 processors, combined with Nutanix Enterprise AI (NAI), achieve up to 2x higher inference throughput for Llama3.1-8B-Instruct workloads compared to 5th Gen Intel® Xeon® processors. This performance was demonstrated on conversational AI applications like Chatbots, meeting a 100ms Time per Output Token (TPOT) SLA. NAI, built on the Nutanix Kubernetes Platform (NKP), streamlines AI model orchestration and deployment by using Virtualized Large Language Models (vLLMs) and Intel® Advanced Matrix Extensions (Intel® AMX) for CPU-based inference. vLLM v0.13.0, optimized with Intel PyTorch Extensions (IPEX), provides high-throughput capabilities through dynamic batching, asynchronous execution, and memory optimization. The benchmark used NAI v2.5 and specific hardware configurations, highlighting efficient scaling and resource utilization for real-time, multi-user AI deployments.
Key takeaway
For MLOps Engineers scaling LLM inference on existing CPU infrastructure, consider Intel Xeon 6 processors with Nutanix Enterprise AI. This combination delivers up to 2x higher throughput for Llama3.1-8B-Instruct workloads, significantly improving efficiency and reducing costs. You can achieve predictable performance and streamlined operations, ensuring high concurrency and rapid response times for your real-time, multi-user AI deployments. Evaluate NAI v2.5 and vLLM v0.13.0 to optimize your CPU-based GenAI serving.
Key insights
Intel Xeon 6 processors with Nutanix Enterprise AI double LLM inference throughput on CPUs, enabling cost-effective GenAI.
Principles
- CPU-based inferencing offers a cost-efficient option for GenAI deployment on existing infrastructure.
- Intelligent workload scheduling ensures balanced resource utilization and continuous operation for fluctuating AI workloads.
- Optimized frameworks like vLLM provide high-throughput inference while maintaining CPU cost-effectiveness.
Method
Nutanix Enterprise AI (NAI) orchestrates and deploys AI models using Virtualized Large Language Models (vLLMs) on Intel Xeon 6 processors, utilizing Intel AMX for matrix math acceleration and IPEX for PyTorch optimization.
In practice
- Deploy Llama3.1-8B-Instruct on Intel Xeon 6 for 2x throughput gains.
- Utilize vLLM with Intel AMX and IPEX for CPU-optimized LLM serving.
- Implement NAI for streamlined AI model deployment and inference management.
Topics
- Intel Xeon 6 Processors
- Nutanix Enterprise AI
- LLM Inference
- vLLM Framework
- CPU Optimization
- Llama 3.1-8B Instruct
Best for: NLP Engineer, CTO, VP of Engineering/Data, MLOps Engineer, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.