Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

An empirical study investigated the scaling behaviors of Large Language Models (LLMs) during reinforcement learning (RL) post-training, specifically focusing on mathematical reasoning. Based on 54 experiments using the Qwen2.5 model family (0.5B to 14B parameters) and 50k mathematics problems, the research characterized how model scale, data volume, and computational budget interact to influence performance. Key findings include that larger models, even with fewer training steps, consistently outperform smaller models under fixed computational budgets. Larger models also exhibit superior sample efficiency with fixed data. In data-constrained scenarios, repeated reuse of high-quality data is highly effective, as performance is primarily driven by total optimization steps rather than unique samples. These scaling behaviors are robust across both base and instruction-tuned models, showing similar learning dynamics despite differences in absolute accuracy. The study also explored in-domain and out-of-domain generalization, noting positive transfer for other math tasks but limited or negative transfer for tasks like code generation or logical reasoning.

Key takeaway

For AI Engineers optimizing LLMs for mathematical reasoning, prioritize larger models (e.g., 14B parameters) even if it means fewer training steps, as they offer superior computational and data efficiency. In situations with limited unique training data, implement aggressive data reuse strategies, as total optimization steps are more critical than sample uniqueness. Be aware that specialized mathematical reasoning gains may not transfer positively to other domains like code generation or general logical reasoning, potentially even causing negative transfer in some cases.

Key insights

Larger LLMs demonstrate superior efficiency and performance in RL post-training for mathematical reasoning.

Principles

Method

The study fine-tuned 54 LLMs using Group Relative Policy Optimization (GRPO) on 50k math problems, systematically varying model size, data volume, and computational budget to analyze performance.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.