Escaping the Verifier: Learning to Reason via Demonstrations

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

RARO (Relativistic Adversarial Reasoning Optimization) is a novel algorithm designed to train Large Language Models (LLMs) for robust reasoning using only expert demonstrations, eliminating the need for task-specific verifiers or human preference data. The method employs Inverse Reinforcement Learning, establishing an adversarial interaction between a policy (generator) and a relativistic critic (discriminator) that learns to distinguish between policy and expert answers. RARO significantly outperforms strong verifier-free baselines across diverse reasoning tasks, including Countdown, DeepMath, and Poetry Writing. On Countdown, it achieved 54.4% accuracy, surpassing SFT (40.7%) by 13.7%, and nearly matched RLVR (57.7%). It also demonstrated robust scaling with model size (1.5B to 7B) and reasoning budget, exhibiting similar trends to RL with verifiable rewards on math problems and eliciting explicit planning behaviors in creative tasks.

Key takeaway

For AI Scientists and Machine Learning Engineers developing LLMs for complex, non-verifiable tasks, RARO offers a robust method to cultivate strong reasoning capabilities using only expert demonstrations. This approach bypasses the need for expensive verifiers or human preference data, which are often unavailable in real-world scenarios. You should consider integrating its adversarial Inverse RL framework, especially for domains like analytical writing or financial analysis, to achieve significant performance gains and stable scaling.

Key insights

RARO enables LLMs to learn robust reasoning from expert demonstrations alone, even without task-specific verifiers or human preferences.

Principles

Adversarial policy-critic interaction drives reasoning capabilities.
A relativistic critic with a "tie" option stabilizes learning dynamics.
Shared LLM for critic/policy reduces memory and promotes generalization.

Method

RARO uses Inverse RL, setting up an adversarial game between a policy (generator) and a relativistic critic (discriminator) to jointly train LLMs for reasoning from expert demonstrations, incorporating stabilization techniques.

In practice

Train reasoning LLMs using only expert demonstrations.
Apply to non-verifiable tasks like creative writing or financial analysis.
Leverage test-time scaling with the learned critic for further gains.

Topics

Large Language Models
Inverse Reinforcement Learning
Adversarial Training
LLM Reasoning
Non-verifiable Tasks
Relativistic Critic

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.