HF ML Club India EP1 | Lewis Tunstall | Teaching Tiny Models to Prove Hard Theorems
Summary
Hugging Face ML Club India hosted a talk by Lewis, research lead on post-training LLMs at Hugging Face and co-developer of models like ZFire and small LM3, introducing QED Nano. This model, developed with collaborators from CMU, ETH, and Numina, is specifically post-trained to prove mathematical theorems. The work was motivated by recent advancements from OpenAI and DeepMind in achieving gold medals in the International Math Olympiad using agentic harnesses. QED Nano's recipe involves curating high-quality theorem-proving prompts, distilling proofs from powerful models like DeepSeekMath v2, and employing reinforcement learning with "rubrics" as rewards, using a mix of models from Gemini 3 to GPT-4. A key innovation is the "reasoning cache," a decoding algorithm enabling coherent reasoning over very long token chains (50,000-100,000 tokens), developed by CMU collaborators. The project also explored how the final model interacts with various agentic scaffolds for mathematical reasoning.
Key takeaway
For AI Scientists and Research Scientists developing advanced reasoning models, consider integrating rubric-based reinforcement learning with a reasoning cache. This approach, demonstrated by QED Nano, enables smaller models to tackle complex, long-horizon problems like theorem proving by providing fine-grained feedback and maintaining coherence over extensive reasoning chains. Your team should prioritize high-quality data curation and explore asynchronous RL architectures to efficiently scale training for such challenging tasks, potentially leveraging distillation from larger models.
Key insights
QED Nano teaches tiny models to prove complex theorems using high-quality data, rubric-based RL, and a reasoning cache.
Principles
- High-quality data is paramount for effective post-training.
- Rubrics provide fine-grained feedback for complex, non-binary tasks.
- Asynchronous RL scales training for long-horizon problems.
Method
QED Nano uses a multi-stage training approach: data curation from sources like AOPS and Olympiad PDFs, SFT with proofs distilled from DeepSeekMath v2, and RL with LLM-generated rubrics (Gemini 3) as rewards, enhanced by a reasoning cache for long-horizon coherence.
In practice
- Use rubrics for RL in domains lacking simple verifiers.
- Implement a reasoning cache for long-chain coherent generation.
- Adopt asynchronous RL for efficiency with variable-length outputs.
Topics
- Theorem Proving
- Reinforcement Learning
- Reasoning Cache
- LLM Distillation
- Agentic Scaffolds
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.