SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

2026-04-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

A study investigates the impact of data overlap between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) stages on the performance of the Qwen3-8B model for Lean 4 autoformalization. Researchers evaluated six training configurations, including SFT-only, GRPO-only, and SFT+GRPO with 0%, 30%, and 100% data overlap between the SFT and GRPO prompt corpora. The findings indicate that minimizing data overlap consistently improves performance, with 0% overlap yielding a 10.4 percentage point semantic gain over SFT alone on Gaokao-Formal. Conversely, 100% overlap rendered the GRPO stage redundant, showing no improvement. The study also highlights significant compile-semantic gaps, exceeding 30 percentage points for top-compiling models, which would be missed by compile-only benchmarks. This is the first controlled investigation into SFT-GRPO data overlap as a post-training hyperparameter.

Key takeaway

For research scientists optimizing post-training recipes for autoformalization models, you should prioritize minimizing data overlap between SFT and GRPO stages. Employing a 0% overlap configuration can yield substantial semantic accuracy gains, as demonstrated by the 10.4 percentage point improvement on Gaokao. Additionally, ensure your evaluation includes both compile pass and semantic pass metrics to avoid overlooking significant performance disparities that compile-only benchmarks might conceal.

Key insights

Minimizing SFT-GRPO data overlap significantly improves autoformalization model performance and semantic accuracy.

Principles

Disjoint SFT and GRPO data outperforms full overlap.
Lower data overlap correlates with higher accuracy.
Compile-only benchmarks can obscure semantic gaps.

Method

The study conducted a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B on Lean 4 autoformalization under six conditions, differing solely in training recipe and data sharing between stages.

In practice

Keep SFT and GRPO data disjoint for better results.
Use dual-metric evaluation to reveal semantic gaps.
Consider data overlap a critical hyperparameter.

Topics

Supervised Fine-tuning
Group Relative Policy Optimization
Autoformalization
Lean 4
Data Overlap

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.