From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The paper introduces PreRL (Pre-train Space Reinforcement Learning), a novel approach that applies reward-driven online updates directly to the marginal distribution P(y) in the pre-train space, rather than solely optimizing the conditional distribution P(y|x) as in traditional reinforcement learning with verifiable rewards (RLVR). This method aims to overcome the limitations of base models by encoding reasoning ability and preserving broad exploration capacity, which conventional pre-training on static corpora often hinders due to distribution shifts. PreRL demonstrates strong gradient alignment between log P(y) and log P(y|x), making it a viable surrogate for standard RL. A key finding is that Negative Sample Reinforcement (NSR) within PreRL significantly enhances reasoning by pruning incorrect reasoning spaces and stimulating reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Building on this, Dual Space RL (DSRL) is proposed, a Policy Reincarnation strategy that initializes models with NSR-PreRL before transitioning to standard RL for fine-grained optimization, consistently outperforming strong baselines.

Key takeaway

For AI Engineers and Research Scientists developing advanced LLMs, consider integrating PreRL and Dual Space RL (DSRL) into your training pipelines. Initializing models with NSR-PreRL can significantly expand reasoning horizons and prune incorrect reasoning spaces, leading to more robust and reflective models before fine-tuning with standard RL. This approach offers a path to overcome the inherent limitations of base models and achieve superior reasoning performance.

Key insights

Optimizing the marginal distribution P(y) in pre-train space enhances LLM reasoning and exploration beyond conditional P(y|x) optimization.

Principles

Method

PreRL applies reward-driven online updates to P(y). DSRL uses NSR-PreRL for initial reasoning expansion, then transitions to standard RL for fine-grained optimization.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.