EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

2025-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

EvoTest is a novel evolutionary test-time learning framework designed to enable AI agents to learn complex skills on the fly without fine-tuning or gradients. It addresses the limitation of current AI agents that struggle to adapt in novel environments by introducing the Jericho Test-Time Learning (J-TTL) benchmark, where agents play the same game for consecutive episodes to improve performance. EvoTest employs a two-agent system: an Actor Agent that plays the game and an Evolver Agent that analyzes episode transcripts to propose revised configurations. These configurations involve rewriting prompts, updating structured memory with effective state-action choices, tuning hyperparameters, and refining tool-use routines. On the J-TTL benchmark, EvoTest consistently outperforms existing adaptation methods like reflection, memory, and online fine-tuning, achieving a 38% improvement over the strongest prompt-evolution baseline and a 57% improvement over online RL, notably winning two games (Detective and Library) where baselines failed.

Key takeaway

For Machine Learning Engineers developing autonomous agents, EvoTest offers a robust framework for achieving rapid, in-session self-improvement. You should consider adopting its gradient-free, whole-system evolution approach, particularly in sparse-reward environments, as it significantly outperforms traditional fine-tuning and reflection methods by leveraging rich narrative feedback for more efficient and stable learning. This can lead to agents that adapt more effectively to novel tasks without extensive retraining.

Key insights

EvoTest enables AI agents to self-improve at test time by evolving their entire configuration using narrative feedback.

Principles

Holistic system evolution improves agent performance.
Narrative analysis is more data-efficient than scalar rewards.
UCB selection stabilizes learning and prevents performance drops.

Method

EvoTest uses an Actor Agent to play and an Evolver Agent to analyze episode transcripts. The Evolver generates new configurations by mutating prompts, updating memory, tuning hyperparameters, and refining tool-use routines, selecting the best via UCB.

In practice

Use J-TTL benchmark for on-the-fly learning evaluation.
Implement a two-agent system for acting and evolving.
Employ UCB for stable configuration selection.

Topics

EvoTest
Test-Time Learning
Jericho Test-Time Learning (J-TTL)
Agentic Systems
Evolutionary Algorithms

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.