Cost-Aware Model Orchestration for LLM-based Systems

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A new framework called GUIDE addresses limitations in Large Language Model (LLM)-orchestrated AI systems, which currently rely on qualitative model descriptions, leading to suboptimal model selection, reduced accuracy, and increased energy costs. GUIDE incorporates quantitative model performance characteristics, such as accuracy and energy consumption, into decision-making. Empirical analysis using JARVIS, a representative LLM-orchestrated framework, revealed that existing methods suffer from task misclassification and popularity-based selection bias, often choosing less accurate or energy-efficient models. GUIDE, by contrast, increases accuracy by 0.90%–11.92% across various tasks, achieves up to 54% energy efficiency improvement (Accuracy-per-Joule), and drastically reduces orchestrator model selection latency from 4.51 seconds to 7.2 milliseconds. The framework utilizes an energy budget tracker and a Pareto-optimization-based model selector to balance performance and energy trade-offs.

Key takeaway

For AI Engineers designing or optimizing LLM-orchestrated systems, relying solely on qualitative model descriptions or LLM internal knowledge for model selection is inefficient and costly. You should integrate quantitative performance and energy metrics, like those used in GUIDE, to enable data-driven, Pareto-optimized model choices. This approach can significantly boost accuracy, reduce energy consumption by up to 54%, and cut model selection latency from seconds to milliseconds, making your AI systems more performant and sustainable.

Key insights

Integrating quantitative performance and energy metrics into LLM orchestration significantly improves accuracy and efficiency.

Principles

Qualitative model descriptions lead to suboptimal selections.
Quantitative metrics enable Pareto-efficient model choices.
Real-time energy monitoring improves resource adherence.

Method

GUIDE employs an energy budget tracker for real-time GPU energy monitoring and a model selector that uses Pareto optimization on accuracy-energy trade-offs to choose the most accurate model within a user-defined energy budget.

In practice

Profile models for accuracy and energy consumption.
Implement real-time GPU energy monitoring.
Apply Pareto optimization for model selection.

Topics

LLM Orchestration
Model Selection
Energy Efficiency
Performance-Energy Trade-offs
GUIDE Framework

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.