NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research
Summary
NebulaExp-8B introduces a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, designed to enhance large language models' reasoning and human preference capabilities. The pipeline curates a raw corpus of 3.84M multi-source SFT samples and a 200K verifiable RL candidate pool, employing an end-to-end data processing stack. For the Instruct branch, a three-stage supervised fine-tuning approach, NebulaExp-Ins-SFT, improved Qwen3-8B-nothink's average benchmark score from 55.01 to 60.99, with GRPO reinforcement learning further raising it to 61.85. The Reasoning branch saw medium-difficulty GRPO RL boost average reasoning scores from 73.88 to 75.17. Additionally, the work explores single-teacher and multi-teacher OPD (MOPD), demonstrating that OPD with 4K samples outperforms RL by 3.26 points on IFEval, achieving a +4.43 average overall gain, while MOPD with 10K samples lifts performance by 4.18. This research provides a reproducible recipe for 8B-scale LLMs and dissects capability trade-offs.
Key takeaway
For machine learning engineers optimizing 8B-scale LLMs, this research offers a transparent, reproducible post-training blueprint. You should consider implementing a multi-stage supervised fine-tuning approach and explore GRPO reinforcement learning to boost instruction adherence. Furthermore, if you face challenges with RL verifier dependency, investigate single-teacher or multi-teacher OPD as a viable alternative, potentially using as few as 4K instruction-following samples for significant gains.
Key insights
An ablation-driven pipeline for 8B LLMs improves instruction following and reasoning via structured SFT, RL, and OPD/MOPD.
Principles
- Transparent post-training enhances reproducibility.
- Multi-stage SFT improves base model performance.
- OPD/MOPD offers RL-alternative for reasoning.
Method
Curate multi-source SFT and RL data, applying distillation and multi-dimensional filtering. Implement three-stage SFT and GRPO RL for instruct models. For reasoning, use GRPO RL or investigate single/multi-teacher OPD.
In practice
- Use 3.84M SFT samples for 8B LLM alignment.
- Apply multi-dimensional filtering to improve data quality.
- Consider OPD with 4K samples as RL alternative.
Topics
- LLM Post-training
- Supervised Fine-tuning
- Reinforcement Learning
- Offline Preference Distillation
- Qwen3-8B
- Model Alignment
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.