LuminaSFT: Generating Synthetic Fine-Tuning Data for Small Language Models

2026-02-24 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

LuminaSFT is a new supervised fine-tuning (SFT) dataset designed to enhance small language models (SLMs), presented on February 24, 2026. The research explores two primary methods: regenerating existing SFT data using a stronger teacher model like DeepSeek-V3 and generating new task-specific data. Experiments show that data regeneration can improve average performance by up to ~4% across tasks like MMLU and GSM8k, with a maximum improvement of ~7% on GSM8k for Instella-3B-base. Task-specific data generation yields more substantial gains, boosting performance by up to ~41% on reading comprehension tasks like DROP for Llama-1B. The study also demonstrates that useful data can be generated from scratch for educational QA tasks using detailed prompts and multi-step pipelines, achieving up to ~2.4% average gain.

Key takeaway

For NLP engineers developing or deploying small language models, consider integrating LuminaSFT's methodologies to boost performance. Regenerating existing SFT data with a powerful teacher model can offer moderate improvements, but focusing on generating task-specific datasets will yield substantially higher gains, particularly for specialized applications like reading comprehension. Even without initial seed data, structured prompting and multi-step pipelines can effectively bootstrap high-quality training data, optimizing your SLM's capabilities for targeted tasks.

Key insights

Synthetic data generation and regeneration significantly improve small language model performance, especially with task-specific approaches.

Principles

Teacher model strength impacts data regeneration efficacy.
Task-specific data yields greater performance gains.
Detailed prompts enable data generation without seed data.

Method

The method involves regenerating SFT data using a stronger teacher model (e.g., DeepSeek-V3) or generating task-specific data from seed datasets or detailed prompts, followed by fine-tuning SLMs.

In practice

Use DeepSeek-V3 for SFT data regeneration.
Prioritize task-specific data generation for large gains.
Employ multi-step generation for educational QA tasks.

Topics

Small Language Models
Supervised Fine-Tuning
Synthetic Data Generation
Teacher-Student Learning
AMD Instinct GPUs

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.