RL Beyond the Verifiable

2024-09-30 · Source: Tanay’s Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, medium

Summary

The article discusses the challenge of applying Reinforcement Learning (RL) to tasks that are difficult to verify, contrasting it with the success of RL with verifiable rewards (RLVR) in domains like math and coding. Dario Amodei, CEO of Anthropic, expressed 90% certainty of achieving a "country of geniuses in a data center" within ten years, with the 10% uncertainty tied to unverifiable tasks such as planning Mars missions or fundamental scientific discovery. RLVR has driven significant progress in areas like math and code, with OpenAI and Google DeepMind models achieving gold-medal level (35 out of 42) at the International Math Olympiad in 2025. The "verifier's law" suggests AI training ease correlates with task verifiability. To address this, techniques like rubrics as rewards (Scale AI reported a 31% relative gain on HealthBench in mid-2025), generative reward models, and process reward models are being explored. Companies are tackling this by selling verifiers and data (Mercor, Taste Labs), formalizing domains (Pramaana Labs), or owning the full experimental loop (Periodic Labs, Isomorphic Labs, Lila Sciences).

Key takeaway

For AI/ML Directors evaluating advanced model capabilities, recognize that progress in subjective, unverifiable domains lags behind verifiable tasks like coding. You should prioritize integrating rubric-based LLM evaluation or exploring domain formalization to expand RL applications. Consider partnerships with companies owning full experimental loops for real-world validation in material science or drug discovery, mitigating risks of ungrounded AI outputs.

Key insights

The primary challenge for advanced AI lies in applying RL to tasks lacking clear, objective verifiability.

Principles

AI training ease scales with task verifiability.
Human preferences or AI principles can guide reward models.
Decomposing complex verification into smaller checks improves reward signals.

Method

When programmatic checkers are absent, approximate them by creating instance-specific rubrics, often anchored to human experts, and using LLMs to score against these detailed checklists.

In practice

Use LLM judges with rubrics for subjective task evaluation.
Formalize fuzzy domains for machine-checkable solutions.
Integrate physical labs for real-world verification loops.

Topics

Reinforcement Learning
Verifiability Constraint
Reward Models
LLM Evaluation
AI Alignment
Formal Verification
Autonomous Labs

Best for: Research Scientist, AI Scientist, Director of AI/ML, Investor

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Tanay’s Newsletter.