Cost-Aware Model Orchestration for LLM-based Systems

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A new framework called GUIDE addresses limitations in Large Language Model (LLM)-orchestrated AI systems, which currently rely on qualitative model descriptions, leading to suboptimal model selection, reduced accuracy, and increased energy costs. GUIDE incorporates quantitative model performance characteristics, such as accuracy and energy consumption, into decision-making. Empirical analysis using JARVIS, a representative LLM-orchestrated framework, revealed that existing methods suffer from task misclassification and popularity-based selection bias, often choosing less accurate or energy-efficient models. GUIDE, by contrast, increases accuracy by 0.90%–11.92% across various tasks, achieves up to 54% energy efficiency improvement (Accuracy-per-Joule), and drastically reduces orchestrator model selection latency from 4.51 seconds to 7.2 milliseconds. The framework utilizes an energy budget tracker and a Pareto-optimization-based model selector to balance performance and energy trade-offs.

Key takeaway

For AI Engineers designing or optimizing LLM-orchestrated systems, relying solely on qualitative model descriptions or LLM internal knowledge for model selection is inefficient and costly. You should integrate quantitative performance and energy metrics, like those used in GUIDE, to enable data-driven, Pareto-optimized model choices. This approach can significantly boost accuracy, reduce energy consumption by up to 54%, and cut model selection latency from seconds to milliseconds, making your AI systems more performant and sustainable.

Key insights

Integrating quantitative performance and energy metrics into LLM orchestration significantly improves accuracy and efficiency.

Principles

Method

GUIDE employs an energy budget tracker for real-time GPU energy monitoring and a model selector that uses Pareto optimization on accuracy-energy trade-offs to choose the most accurate model within a user-defined energy budget.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.