PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
Summary
PALS, a novel power-aware runtime for large language model (LLM) serving, addresses the significant GPU energy consumption in data centers by treating power caps as a controllable resource. Unlike prior systems that optimize throughput and latency while considering GPU power as static, PALS jointly optimizes power caps with software parameters like batch size. It integrates lightweight offline power-performance models with a feedback-driven controller to achieve throughput targets while maximizing energy efficiency. Implemented within the vLLM framework, PALS requires no model retraining or API changes. Across multi-GPU systems and both dense and Mixture-of-Experts (MoE) models, PALS demonstrates improved energy efficiency by up to 26.3% and reduces Quality of Service (QoS) violations by 4x to 7x under power constraints, also tracking dynamic power budgets. This innovation enables more energy-proportional and grid-interactive AI systems.
Key takeaway
For MLOps Engineers managing LLM inference infrastructure, PALS offers a critical shift in optimizing energy consumption. If you are struggling with high GPU energy costs or QoS violations under power constraints, consider integrating power-aware runtimes like PALS. This approach allows you to dynamically manage GPU power caps and batch sizes, potentially improving energy efficiency by up to 26.3% and reducing QoS violations by 4x to 7x without model retraining. Evaluate existing LLM serving frameworks for power-aware extensions.
Key insights
PALS optimizes LLM serving energy efficiency by dynamically controlling GPU power caps and batch sizes using feedback-driven models.
Principles
- GPU power caps are a dynamic control knob.
- Jointly optimize power with software parameters.
- Feedback control improves energy efficiency.
Method
PALS combines lightweight offline power-performance models with a feedback-driven controller. It jointly optimizes GPU power caps and software parameters like batch size to meet throughput targets and maximize energy efficiency.
In practice
- Integrate power control into LLM runtimes.
- Reduce QoS violations under power limits.
- Enable grid-interactive AI systems.
Topics
- LLM Serving
- Mixture-of-Experts
- GPU Power Management
- Energy Efficiency
- vLLM
- Data Center Optimization
Best for: Machine Learning Engineer, NLP Engineer, CTO, MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.