Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Evoflux is an inference-time evolutionary search method designed to improve tool use for compact language models (LMs) by repairing executable tool workflows. Small LMs often struggle with complex MCP-style tool use, failing in areas like tool resolution, parameter validation, and dependency tracking, a problem not effectively addressed by small-corpus distillation. Evoflux addresses this by evolving typed workflow graphs through structured edits, execution feedback, and diversity pruning. Benchmarking on held-out MCP-Bench tasks, involving live MCP servers and 250 tools, Evoflux significantly raised execution feasibility for small planners from approximately 3% to 17-24%. In contrast, SFT and SFT+DPO methods performed worse, while ReAct showed higher variance and token cost, demonstrating Evoflux's reliability under limited teacher-trace budgets.

Key takeaway

For Machine Learning Engineers deploying compact language models for complex tool orchestration, consider integrating inference-time evolutionary search methods like Evoflux. This approach significantly boosts execution feasibility from approximately 3% to 17-24% on challenging tasks, outperforming traditional fine-tuning methods when teacher-trace data is limited. You should explore execution-grounded search to build more robust and reliable tool agents, especially where plan repair and dynamic adaptation to changing tool catalogs are critical.

Key insights

Evoflux uses inference-time evolutionary search to repair executable tool workflows, significantly improving compact LM tool use.

Principles

Small LMs struggle with complex tool orchestration.
Small-corpus distillation fails for recovery behavior.
Execution-grounded search enhances reliability.

Method

Evoflux evolves typed workflow graphs via structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning to repair failed plans.

In practice

Improve compact LM tool use reliability.
Enhance workflow feasibility from ~3% to 17-24%.
Outperform SFT/DPO in scarce teacher-trace settings.

Topics

Evoflux
Compact Language Models
Tool Use Agents
Evolutionary Search
Workflow Orchestration
Inference Optimization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.