Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers at New York University developed a three-tier inference scaffolding pipeline to significantly improve the performance of small language model (LLM) agents on complex tool-use tasks without additional training. Evaluating Qwen3-8B on the AppWorld benchmark using a single 24 GB GPU, the raw model achieved only 5.4% (FP16) and 3.0% (AWQ) task goal completion. The scaffolding, which deploys the same frozen model in distinct roles—a summarizer, a main agent, and an isolated corrector—nearly doubled performance to 8.9% (FP16) and 5.9% (AWQ). This intervention allowed the scaffolded 8B model to surpass DeepSeek-Coder 33B Instruct's 7.1% score on the same benchmark, demonstrating that structured inference-time interventions can make smaller models competitive with systems four times their size by mitigating mechanical failures like authentication issues and API schema mismatches.

Key takeaway

For AI Engineers deploying LLM agents on resource-constrained hardware, consider implementing inference-time scaffolding to enhance small model performance. This approach, which nearly doubles task completion rates for an 8B model, addresses common mechanical failures like authentication and API compliance without requiring costly retraining. You should focus on a modular design where the same model is specialized for context summarization, primary agency, and isolated correction to achieve substantial gains and unmask underlying reasoning challenges for future optimization.

Key insights

Inference-time scaffolding can significantly boost small LLM agent performance on complex tasks without retraining.

Principles

Role specialization improves frozen model efficacy.
Mechanical failures mask core reasoning limitations.
Context isolation can regularize correction policies.

Method

A three-tier pipeline uses the same frozen model as a summarizer for context compression, a main agent for action generation, and an isolated corrector for code review and revision based on execution feedback and API documentation.

In practice

Implement a summarization module to preserve critical artifacts.
Use an isolated correction model for robust code revision.
Prioritize numerical precision over extended context for some tasks.

Topics

Inference Scaffolding
LLM Agents
AppWorld Benchmark
Qwen3-8B Performance
Context Management

Code references

Aimpoint-Digital/appworld-agent

Best for: AI Engineer, Research Scientist, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.