Weak-to-Strong Elicitation via Mismatched Wrong Drafts

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study investigates whether injecting mathematically "wrong" drafts from a smaller, domain-trained model into a stronger learner's GRPO context can elicit capabilities not achieved by standard on-policy RL fine-tuning. Using Mathstral-7B as the learner and Qwen2.5-Math-1.5B as the draft model, trained on 8.8K Level 3–5 MATH problems, the "mismatched-wrong" variant consistently outperformed standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Specifically, shuffling wrong drafts to mismatched problems yielded a +1.62 percentage point (pp) gain on MATH-500 (greedy pass@1) over matched-wrong variants. On AIME 2025/2026, this method uniquely lifted pass@k above both Mathstral-7B and Qwen2.5-Math-1.5B at all sample budgets from k=1 to k=1024, achieving +14.2pp on 2025 and +9.0pp on 2026 at pass@1024 over Mathstral-7B. The recipe, trained on a single GPU without SFT, reward models, or synthesized data, reached 71.98% on MATH-500, surpassing the WizardMath pipeline's 70.9% on full MATH.

Key takeaway

For research scientists exploring advanced LLM fine-tuning, you should consider integrating "mismatched-wrong" draft injection into your on-policy RL workflows. This technique, demonstrated to expand reasoning coverage and achieve higher benchmark scores with a simpler setup, challenges the notion that on-policy RL only sharpens existing capabilities. Experiment with this approach to elicit latent knowledge and avoid common optimization shortcuts like copying or anchoring.

Key insights

Injecting mismatched, mathematically "wrong" drafts into a strong learner's context expands its reasoning capabilities beyond standard on-policy RL.

Principles

Mismatched wrong drafts act as off-policy explorers.
Closing optimization shortcuts forces intrinsic reasoning.
On-policy RL can expand, not just sharpen, capabilities.

Method

Train a strong LLM (Mathstral-7B) with Dr. GRPO by augmenting prompts with mathematically wrong drafts from a weaker, domain-trained model (Qwen2.5-Math-1.5B) that are randomly shuffled to different problems.

In practice

Use a weaker model's "wrong" outputs as contextual probes.
Randomly permute drafts to different problems for training.
Focus on outcome-only reward for initial training efficiency.

Topics

Weak-to-Strong Elicitation
Mismatched Wrong Drafts
GRPO
Mathstral-7B
Mathematical Reasoning

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.