SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

2026-04-09 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SUPERNOVA is a novel data curation framework designed to enhance large language model (LLM) general reasoning capabilities using Reinforcement Learning with Verifiable Rewards (RLVR). While RLVR has shown success in formal domains like mathematics, LLMs still face challenges in general reasoning tasks such as causal inference and temporal understanding due to a scarcity of high-quality, verifiable training data. SUPERNOVA addresses this by adapting rich reasoning patterns from expert-annotated instruction-tuning datasets for RLVR. Through over 100 controlled RL experiments, the framework investigates source task selection, task mixing strategies, and synthetic interventions for data quality. The research indicates that source task selection significantly impacts reasoning performance, with task-specific selection outperforming average performance strategies. Models trained with SUPERNOVA achieve up to 52.8% relative improvement on the BBEH benchmark across various model sizes, surpassing baselines like Qwen3.5 on benchmarks including BBEH, Zebralogic, and MMLU-Pro.

Key takeaway

For AI engineers and research scientists focused on improving LLM general reasoning, SUPERNOVA offers a principled data curation approach. You should consider adapting existing expert-annotated instruction-tuning datasets for RLVR, paying close attention to source task selection. Prioritizing tasks based on individual target performance rather than overall averages can yield significant improvements, as demonstrated by the 52.8% relative gain on BBEH, enhancing model performance on complex reasoning benchmarks.

Key insights

SUPERNOVA enhances LLM general reasoning via RLVR by curating high-quality data from instruction-tuning datasets.

Principles

Source task selection is critical for RLVR performance.
Task-specific data selection outperforms general strategies.

Method

SUPERNOVA systematically adapts expert-annotated instruction-tuning datasets for RLVR, analyzing source task selection, mixing strategies, and synthetic interventions to improve data quality for general reasoning tasks.

In practice

Utilize instruction-tuning datasets for RLVR data.
Prioritize task-specific data selection for better outcomes.

Topics

Reinforcement Learning with Verifiable Rewards
Large Language Models
General Reasoning
Data Curation Framework
Instruction Tuning Datasets

Code references

asuvarna31/supernova

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.