SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

A controlled study investigated the impact of data overlap between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) stages on autoformalization models. Using a Qwen3-8B model, researchers evaluated six training configurations for Lean 4 autoformalization, varying the overlap of GRPO prompts with the SFT corpus at 0%, 30%, and 100%. The study found that lower overlap consistently led to higher compilation and semantic accuracy on Gaokao-Formal and PutnamBench. Specifically, 0% overlap yielded a 10.4 percentage-point semantic gain over SFT alone on Gaokao, while 100% overlap rendered the GRPO stage redundant. The research also highlighted significant compile-semantic gaps, exceeding 30 percentage points for high-compiling models, which are undetectable with compile-only evaluation.

Key takeaway

For AI Engineers developing autoformalization systems, explicitly managing data overlap between SFT and GRPO stages is critical. You should prioritize constructing disjoint data pools for GRPO training when possible, as this strategy consistently improves both compilation and semantic accuracy without additional computational cost. Furthermore, always employ dual-metric evaluation (compile pass@k and semantic pass@k) to avoid overlooking significant semantic inaccuracies in high-compiling models.

Key insights

Disjoint SFT and GRPO data pools significantly enhance autoformalization model performance without extra compute.

Principles

Lower data overlap improves accuracy monotonically.
Full data overlap makes GRPO redundant.
Compilation alone is insufficient for quality assessment.

Method

The study used a dual-stage reward function for GRPO, combining a compiler gate with a continuous semantic judge (Gemini Flash 3) to provide indirect semantic supervision for Lean 4 autoformalization.

In practice

Partition SFT and GRPO data pools for better results.
Use dual-metric evaluation (compile + semantic pass@k).
Implement "answer injection" for RL training on SFT datasets.

Topics

SFT-GRPO Data Overlap
Autoformalization
Lean 4 Formalization
Group Relative Policy Optimization
Semantic Pass@k

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.