Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
Summary
Researchers at New York University developed a three-tier inference scaffolding pipeline to significantly improve the performance of small language model (LLM) agents on complex tool-use tasks without additional training. Evaluating Qwen3-8B on the AppWorld benchmark using a single 24 GB GPU, the raw model achieved only 5.4% (FP16) and 3.0% (AWQ) task goal completion. The scaffolding, which deploys the same frozen model in distinct roles—a summarizer, a main agent, and an isolated corrector—nearly doubled performance to 8.9% (FP16) and 5.9% (AWQ). This intervention allowed the scaffolded 8B model to surpass DeepSeek-Coder 33B Instruct's 7.1% score on the same benchmark, demonstrating that structured inference-time interventions can make smaller models competitive with systems four times their size by mitigating mechanical failures like authentication issues and API schema mismatches.
Key takeaway
For AI Engineers deploying LLM agents on resource-constrained hardware, consider implementing inference-time scaffolding to enhance small model performance. This approach, which nearly doubles task completion rates for an 8B model, addresses common mechanical failures like authentication and API compliance without requiring costly retraining. You should focus on a modular design where the same model is specialized for context summarization, primary agency, and isolated correction to achieve substantial gains and unmask underlying reasoning challenges for future optimization.
Key insights
Inference-time scaffolding can significantly boost small LLM agent performance on complex tasks without retraining.
Principles
- Role specialization improves frozen model efficacy.
- Mechanical failures mask core reasoning limitations.
- Context isolation can regularize correction policies.
Method
A three-tier pipeline uses the same frozen model as a summarizer for context compression, a main agent for action generation, and an isolated corrector for code review and revision based on execution feedback and API documentation.
In practice
- Implement a summarization module to preserve critical artifacts.
- Use an isolated correction model for robust code revision.
- Prioritize numerical precision over extended context for some tasks.
Topics
- Inference Scaffolding
- LLM Agents
- AppWorld Benchmark
- Qwen3-8B Performance
- Context Management
Code references
Best for: AI Engineer, Research Scientist, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.