AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
Summary
AMARIS, a Memory-Augmented Rubric Improvement System, enhances rubric-based reward shaping for fine-tuning Large Language Models (LLMs) using Reinforcement Learning (RL). Unlike prior methods that discard evaluation diagnostics after immediate use, AMARIS grounds rubric modifications in long-term training history. It analyzes individual rollouts, aggregates findings into step-level summaries, and retrieves relevant historical context from a persistent evaluation memory using both static (recent steps) and dynamic (semantically matched) retrieval. This system updates rubrics based on accumulated analyses, operating asynchronously alongside the normal RL loop with approximately 5% time overhead. Experiments in both closed and open-ended domains demonstrate that AMARIS consistently outperforms baseline methods, with ablation studies confirming the performance contributions of both static and dynamic memory retrieval.
Key takeaway
For AI Engineers fine-tuning LLMs with RL, adopting a memory-augmented rubric system like AMARIS can significantly improve model performance and training efficiency. You should consider integrating persistent evaluation memory to avoid re-deriving evaluation principles and to support curriculum-like progression, potentially reducing training time and enhancing model quality.
Key insights
Persistent evaluation memory significantly improves rubric-based reward shaping for LLM fine-tuning in RL.
Principles
- Accumulate evaluation knowledge over time.
- Combine static and dynamic memory retrieval.
- Asynchronous execution minimizes overhead.
Method
AMARIS analyzes rollouts, summarizes findings, retrieves historical context from persistent memory, and updates rubrics asynchronously to improve RL training.
In practice
- Implement persistent evaluation memory.
- Use both recent and semantically matched history.
- Run evaluation improvements asynchronously.
Topics
- AMARIS
- Rubric-Based Reinforcement Learning
- Reward Shaping
- Large Language Models
- Evaluation Memory
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.