SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Summary
A controlled study investigated the impact of data overlap between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) stages on autoformalization models. Using a Qwen3-8B model, researchers evaluated six training configurations for Lean 4 autoformalization, varying the overlap of GRPO prompts with the SFT corpus at 0%, 30%, and 100%. The study found that lower overlap consistently led to higher compilation and semantic accuracy on Gaokao-Formal and PutnamBench. Specifically, 0% overlap yielded a 10.4 percentage-point semantic gain over SFT alone on Gaokao, while 100% overlap rendered the GRPO stage redundant. The research also highlighted significant compile-semantic gaps, exceeding 30 percentage points for high-compiling models, which are undetectable with compile-only evaluation.
Key takeaway
For AI Engineers developing autoformalization systems, explicitly managing data overlap between SFT and GRPO stages is critical. You should prioritize constructing disjoint data pools for GRPO training when possible, as this strategy consistently improves both compilation and semantic accuracy without additional computational cost. Furthermore, always employ dual-metric evaluation (compile pass@k and semantic pass@k) to avoid overlooking significant semantic inaccuracies in high-compiling models.
Key insights
Disjoint SFT and GRPO data pools significantly enhance autoformalization model performance without extra compute.
Principles
- Lower data overlap improves accuracy monotonically.
- Full data overlap makes GRPO redundant.
- Compilation alone is insufficient for quality assessment.
Method
The study used a dual-stage reward function for GRPO, combining a compiler gate with a continuous semantic judge (Gemini Flash 3) to provide indirect semantic supervision for Lean 4 autoformalization.
In practice
- Partition SFT and GRPO data pools for better results.
- Use dual-metric evaluation (compile + semantic pass@k).
- Implement "answer injection" for RL training on SFT datasets.
Topics
- SFT-GRPO Data Overlap
- Autoformalization
- Lean 4 Formalization
- Group Relative Policy Optimization
- Semantic Pass@k
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.