NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research

2026-06-25 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

NebulaExp-8B introduces a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, designed to enhance large language models' reasoning and human preference capabilities. The pipeline curates a raw corpus of 3.84M multi-source SFT samples and a 200K verifiable RL candidate pool, employing an end-to-end data processing stack. For the Instruct branch, a three-stage supervised fine-tuning approach, NebulaExp-Ins-SFT, improved Qwen3-8B-nothink's average benchmark score from 55.01 to 60.99, with GRPO reinforcement learning further raising it to 61.85. The Reasoning branch saw medium-difficulty GRPO RL boost average reasoning scores from 73.88 to 75.17. Additionally, the work explores single-teacher and multi-teacher OPD (MOPD), demonstrating that OPD with 4K samples outperforms RL by 3.26 points on IFEval, achieving a +4.43 average overall gain, while MOPD with 10K samples lifts performance by 4.18. This research provides a reproducible recipe for 8B-scale LLMs and dissects capability trade-offs.

Key takeaway

For machine learning engineers optimizing 8B-scale LLMs, this research offers a transparent, reproducible post-training blueprint. You should consider implementing a multi-stage supervised fine-tuning approach and explore GRPO reinforcement learning to boost instruction adherence. Furthermore, if you face challenges with RL verifier dependency, investigate single-teacher or multi-teacher OPD as a viable alternative, potentially using as few as 4K instruction-following samples for significant gains.

Key insights

An ablation-driven pipeline for 8B LLMs improves instruction following and reasoning via structured SFT, RL, and OPD/MOPD.

Principles

Transparent post-training enhances reproducibility.
Multi-stage SFT improves base model performance.
OPD/MOPD offers RL-alternative for reasoning.

Method

Curate multi-source SFT and RL data, applying distillation and multi-dimensional filtering. Implement three-stage SFT and GRPO RL for instruct models. For reasoning, use GRPO RL or investigate single/multi-teacher OPD.

In practice

Use 3.84M SFT samples for 8B LLM alignment.
Apply multi-dimensional filtering to improve data quality.
Consider OPD with 4K samples as RL alternative.

Topics

LLM Post-training
Supervised Fine-tuning
Reinforcement Learning
Offline Preference Distillation
Qwen3-8B
Model Alignment

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.