EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
Summary
EvoTest is a novel evolutionary test-time learning framework designed to enable AI agents to learn complex skills on the fly without fine-tuning or gradients. It addresses the limitation of current AI agents that struggle to adapt in novel environments by introducing the Jericho Test-Time Learning (J-TTL) benchmark, where agents play the same game for consecutive episodes to improve performance. EvoTest employs a two-agent system: an Actor Agent that plays the game and an Evolver Agent that analyzes episode transcripts to propose revised configurations. These configurations involve rewriting prompts, updating structured memory with effective state-action choices, tuning hyperparameters, and refining tool-use routines. On the J-TTL benchmark, EvoTest consistently outperforms existing adaptation methods like reflection, memory, and online fine-tuning, achieving a 38% improvement over the strongest prompt-evolution baseline and a 57% improvement over online RL, notably winning two games (Detective and Library) where baselines failed.
Key takeaway
For Machine Learning Engineers developing autonomous agents, EvoTest offers a robust framework for achieving rapid, in-session self-improvement. You should consider adopting its gradient-free, whole-system evolution approach, particularly in sparse-reward environments, as it significantly outperforms traditional fine-tuning and reflection methods by leveraging rich narrative feedback for more efficient and stable learning. This can lead to agents that adapt more effectively to novel tasks without extensive retraining.
Key insights
EvoTest enables AI agents to self-improve at test time by evolving their entire configuration using narrative feedback.
Principles
- Holistic system evolution improves agent performance.
- Narrative analysis is more data-efficient than scalar rewards.
- UCB selection stabilizes learning and prevents performance drops.
Method
EvoTest uses an Actor Agent to play and an Evolver Agent to analyze episode transcripts. The Evolver generates new configurations by mutating prompts, updating memory, tuning hyperparameters, and refining tool-use routines, selecting the best via UCB.
In practice
- Use J-TTL benchmark for on-the-fly learning evaluation.
- Implement a two-agent system for acting and evolving.
- Employ UCB for stable configuration selection.
Topics
- EvoTest
- Test-Time Learning
- Jericho Test-Time Learning (J-TTL)
- Agentic Systems
- Evolutionary Algorithms
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.