SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Summary
A study investigates the impact of data overlap between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) stages on the performance of the Qwen3-8B model for Lean 4 autoformalization. Researchers evaluated six training configurations, including SFT-only, GRPO-only, and SFT+GRPO with 0%, 30%, and 100% data overlap between the SFT and GRPO prompt corpora. The findings indicate that minimizing data overlap consistently improves performance, with 0% overlap yielding a 10.4 percentage point semantic gain over SFT alone on Gaokao-Formal. Conversely, 100% overlap rendered the GRPO stage redundant, showing no improvement. The study also highlights significant compile-semantic gaps, exceeding 30 percentage points for top-compiling models, which would be missed by compile-only benchmarks. This is the first controlled investigation into SFT-GRPO data overlap as a post-training hyperparameter.
Key takeaway
For research scientists optimizing post-training recipes for autoformalization models, you should prioritize minimizing data overlap between SFT and GRPO stages. Employing a 0% overlap configuration can yield substantial semantic accuracy gains, as demonstrated by the 10.4 percentage point improvement on Gaokao. Additionally, ensure your evaluation includes both compile pass and semantic pass metrics to avoid overlooking significant performance disparities that compile-only benchmarks might conceal.
Key insights
Minimizing SFT-GRPO data overlap significantly improves autoformalization model performance and semantic accuracy.
Principles
- Disjoint SFT and GRPO data outperforms full overlap.
- Lower data overlap correlates with higher accuracy.
- Compile-only benchmarks can obscure semantic gaps.
Method
The study conducted a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B on Lean 4 autoformalization under six conditions, differing solely in training recipe and data sharing between stages.
In practice
- Keep SFT and GRPO data disjoint for better results.
- Use dual-metric evaluation to reveal semantic gaps.
- Consider data overlap a critical hyperparameter.
Topics
- Supervised Fine-tuning
- Group Relative Policy Optimization
- Autoformalization
- Lean 4
- Data Overlap
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.