Intel® Xeon® 6 Processors and Nutanix Enterprise AI Deliver 2x Higher Throughput on LLM Inferencing

· Source: Artificial Intelligence (AI) articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

Intel® Xeon® 6 processors, combined with Nutanix Enterprise AI (NAI), achieve up to 2x higher inference throughput for Llama3.1-8B-Instruct workloads compared to 5th Gen Intel® Xeon® processors. This performance was demonstrated on conversational AI applications like Chatbots, meeting a 100ms Time per Output Token (TPOT) SLA. NAI, built on the Nutanix Kubernetes Platform (NKP), streamlines AI model orchestration and deployment by using Virtualized Large Language Models (vLLMs) and Intel® Advanced Matrix Extensions (Intel® AMX) for CPU-based inference. vLLM v0.13.0, optimized with Intel PyTorch Extensions (IPEX), provides high-throughput capabilities through dynamic batching, asynchronous execution, and memory optimization. The benchmark used NAI v2.5 and specific hardware configurations, highlighting efficient scaling and resource utilization for real-time, multi-user AI deployments.

Key takeaway

For MLOps Engineers scaling LLM inference on existing CPU infrastructure, consider Intel Xeon 6 processors with Nutanix Enterprise AI. This combination delivers up to 2x higher throughput for Llama3.1-8B-Instruct workloads, significantly improving efficiency and reducing costs. You can achieve predictable performance and streamlined operations, ensuring high concurrency and rapid response times for your real-time, multi-user AI deployments. Evaluate NAI v2.5 and vLLM v0.13.0 to optimize your CPU-based GenAI serving.

Key insights

Intel Xeon 6 processors with Nutanix Enterprise AI double LLM inference throughput on CPUs, enabling cost-effective GenAI.

Principles

Method

Nutanix Enterprise AI (NAI) orchestrates and deploys AI models using Virtualized Large Language Models (vLLMs) on Intel Xeon 6 processors, utilizing Intel AMX for matrix math acceleration and IPEX for PyTorch optimization.

In practice

Topics

Best for: NLP Engineer, CTO, VP of Engineering/Data, MLOps Engineer, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.