Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition
Summary
A study investigates knowledge distillation from the large reasoning model DeepSeek-R1 to the compact Qwen2.5-7B student model, focusing on mathematical problem-solving. Researchers created a Chain-of-Thought (CoT) training corpus using historical problems from the John O'Bryan Mathematics Competition (2011-2025) via a dual-agent framework. The student model was fine-tuned with Low-Rank Adaptation (LoRA) on Apple Silicon hardware using the MLX framework. While the base Qwen2.5-7B achieved 64.67% accuracy and the DeepSeek-R1 teacher 91.40%, the fine-tuned student model reached a mean accuracy of 69.43% (std dev 0.17%) on the competition dataset, a 4.76 percentage-point improvement. It also generalized to 73.1% (std dev 0.18%) on the MATH-500 benchmark. The research also found that accuracy declines significantly with reduced response length, from 69.43% at R1 (mean 220 words) to 41.9% at R6 (mean 31.2 words), highlighting response length as a critical factor.
Key takeaway
For Machine Learning Engineers developing compact reasoning models, consider Chain-of-Thought distillation to boost performance. You should implement LoRA fine-tuning on hardware like Apple Silicon, but carefully monitor validation loss and limit training iterations to around 200 to prevent overfitting. Additionally, ensure your models generate sufficiently long responses for complex mathematical reasoning, as accuracy significantly drops with shorter outputs.
Key insights
Chain-of-Thought distillation significantly improves compact models' mathematical reasoning, with response length being a critical factor.
Principles
- CoT distillation enhances compact model performance.
- Response length is critical for reasoning quality.
- Overfitting can occur early in distillation training.
Method
A dual-agent framework generates a Chain-of-Thought corpus from competition problems, used to LoRA fine-tune a student model on Apple Silicon with MLX, limiting training iterations to prevent overfitting.
In practice
- Fine-tune with LoRA on Apple Silicon.
- Limit training iterations to ~200 to avoid overfitting.
- Monitor response length for reasoning tasks.
Topics
- Knowledge Distillation
- Chain-of-Thought
- Mathematical Reasoning
- LoRA Fine-tuning
- Compact Language Models
- Apple Silicon MLX
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.