SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Summary
SUPERNOVA is a novel data curation framework designed to enhance large language model (LLM) general reasoning capabilities using Reinforcement Learning with Verifiable Rewards (RLVR). While RLVR has shown success in formal domains like mathematics, LLMs still face challenges in general reasoning tasks such as causal inference and temporal understanding due to a scarcity of high-quality, verifiable training data. SUPERNOVA addresses this by adapting rich reasoning patterns from expert-annotated instruction-tuning datasets for RLVR. Through over 100 controlled RL experiments, the framework investigates source task selection, task mixing strategies, and synthetic interventions for data quality. The research indicates that source task selection significantly impacts reasoning performance, with task-specific selection outperforming average performance strategies. Models trained with SUPERNOVA achieve up to 52.8% relative improvement on the BBEH benchmark across various model sizes, surpassing baselines like Qwen3.5 on benchmarks including BBEH, Zebralogic, and MMLU-Pro.
Key takeaway
For AI engineers and research scientists focused on improving LLM general reasoning, SUPERNOVA offers a principled data curation approach. You should consider adapting existing expert-annotated instruction-tuning datasets for RLVR, paying close attention to source task selection. Prioritizing tasks based on individual target performance rather than overall averages can yield significant improvements, as demonstrated by the 52.8% relative gain on BBEH, enhancing model performance on complex reasoning benchmarks.
Key insights
SUPERNOVA enhances LLM general reasoning via RLVR by curating high-quality data from instruction-tuning datasets.
Principles
- Source task selection is critical for RLVR performance.
- Task-specific data selection outperforms general strategies.
Method
SUPERNOVA systematically adapts expert-annotated instruction-tuning datasets for RLVR, analyzing source task selection, mixing strategies, and synthetic interventions to improve data quality for general reasoning tasks.
In practice
- Utilize instruction-tuning datasets for RLVR data.
- Prioritize task-specific data selection for better outcomes.
Topics
- Reinforcement Learning with Verifiable Rewards
- Large Language Models
- General Reasoning
- Data Curation Framework
- Instruction Tuning Datasets
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.