Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

2026-05-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Hista and Numca are two novel techniques designed to improve state value estimation in large language model (LLM) reinforcement learning (RL) post-training. Accurate state value estimation is crucial for stable RL, but existing methods, particularly critics in standard approaches like PPO, often collapse to a coarse group-average baseline when applied to LLMs. To address this, researchers introduced the State Value Estimation Benchmark (SVEB) for assessment. Numca leverages numerical spans as gradable milestones for state value estimation, while Hista is a framework that utilizes LLM's hidden states as representations to weighted average disjoint rollouts and their returns. Experiments confirm that both Hista and Numca provide more accurate state value estimates and enhance training performance across various RL algorithms and model sizes, all without incurring significant computational overhead.

Key takeaway

For Machine Learning Engineers fine-tuning large language models with reinforcement learning, if you are encountering unstable training or suboptimal performance due to poor state value estimation, consider integrating Hista or Numca. Your current PPO critics might be collapsing to coarse baselines, hindering progress. Implementing these techniques can yield more accurate state value estimates and enhance overall training stability and performance across various RL algorithms without significant computational overhead.

Key insights

Hista and Numca enhance LLM reinforcement learning stability and performance by providing more accurate state value estimates.

Principles

Accurate state value estimation stabilizes RL.
Standard RL critics often collapse in LLMs.
LLM hidden states improve value estimation.

Method

Numca estimates state value using numerical spans as gradable milestones. Hista employs LLM hidden states to weighted average disjoint rollouts and their returns.

In practice

Apply Hista/Numca for LLM RL.
Benchmark state estimation with SVEB.
Utilize hidden states for value representation.

Topics

Reinforcement Learning
Large Language Models
State Value Estimation
Hista
Numca
PPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.