Weak-to-Strong Elicitation via Mismatched Wrong Drafts
Summary
A study investigates whether injecting mathematically "wrong" drafts from a smaller, domain-trained model into a stronger learner's GRPO context can elicit capabilities not achieved by standard on-policy RL fine-tuning. Using Mathstral-7B as the learner and Qwen2.5-Math-1.5B as the draft model, trained on 8.8K Level 3–5 MATH problems, the "mismatched-wrong" variant consistently outperformed standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Specifically, shuffling wrong drafts to mismatched problems yielded a +1.62 percentage point (pp) gain on MATH-500 (greedy pass@1) over matched-wrong variants. On AIME 2025/2026, this method uniquely lifted pass@k above both Mathstral-7B and Qwen2.5-Math-1.5B at all sample budgets from k=1 to k=1024, achieving +14.2pp on 2025 and +9.0pp on 2026 at pass@1024 over Mathstral-7B. The recipe, trained on a single GPU without SFT, reward models, or synthesized data, reached 71.98% on MATH-500, surpassing the WizardMath pipeline's 70.9% on full MATH.
Key takeaway
For research scientists exploring advanced LLM fine-tuning, you should consider integrating "mismatched-wrong" draft injection into your on-policy RL workflows. This technique, demonstrated to expand reasoning coverage and achieve higher benchmark scores with a simpler setup, challenges the notion that on-policy RL only sharpens existing capabilities. Experiment with this approach to elicit latent knowledge and avoid common optimization shortcuts like copying or anchoring.
Key insights
Injecting mismatched, mathematically "wrong" drafts into a strong learner's context expands its reasoning capabilities beyond standard on-policy RL.
Principles
- Mismatched wrong drafts act as off-policy explorers.
- Closing optimization shortcuts forces intrinsic reasoning.
- On-policy RL can expand, not just sharpen, capabilities.
Method
Train a strong LLM (Mathstral-7B) with Dr. GRPO by augmenting prompts with mathematically wrong drafts from a weaker, domain-trained model (Qwen2.5-Math-1.5B) that are randomly shuffled to different problems.
In practice
- Use a weaker model's "wrong" outputs as contextual probes.
- Randomly permute drafts to different problems for training.
- Focus on outcome-only reward for initial training efficiency.
Topics
- Weak-to-Strong Elicitation
- Mismatched Wrong Drafts
- GRPO
- Mathstral-7B
- Mathematical Reasoning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.