PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

PALS, a novel power-aware runtime for large language model (LLM) serving, addresses the significant GPU energy consumption in data centers by treating power caps as a controllable resource. Unlike prior systems that optimize throughput and latency while considering GPU power as static, PALS jointly optimizes power caps with software parameters like batch size. It integrates lightweight offline power-performance models with a feedback-driven controller to achieve throughput targets while maximizing energy efficiency. Implemented within the vLLM framework, PALS requires no model retraining or API changes. Across multi-GPU systems and both dense and Mixture-of-Experts (MoE) models, PALS demonstrates improved energy efficiency by up to 26.3% and reduces Quality of Service (QoS) violations by 4x to 7x under power constraints, also tracking dynamic power budgets. This innovation enables more energy-proportional and grid-interactive AI systems.

Key takeaway

For MLOps Engineers managing LLM inference infrastructure, PALS offers a critical shift in optimizing energy consumption. If you are struggling with high GPU energy costs or QoS violations under power constraints, consider integrating power-aware runtimes like PALS. This approach allows you to dynamically manage GPU power caps and batch sizes, potentially improving energy efficiency by up to 26.3% and reducing QoS violations by 4x to 7x without model retraining. Evaluate existing LLM serving frameworks for power-aware extensions.

Key insights

PALS optimizes LLM serving energy efficiency by dynamically controlling GPU power caps and batch sizes using feedback-driven models.

Principles

Method

PALS combines lightweight offline power-performance models with a feedback-driven controller. It jointly optimizes GPU power caps and software parameters like batch size to meet throughput targets and maximize energy efficiency.

In practice

Topics

Best for: Machine Learning Engineer, NLP Engineer, CTO, MLOps Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.