Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

A study investigates knowledge distillation from the large reasoning model DeepSeek-R1 to the compact Qwen2.5-7B student model, focusing on mathematical problem-solving. Researchers created a Chain-of-Thought (CoT) training corpus using historical problems from the John O'Bryan Mathematics Competition (2011-2025) via a dual-agent framework. The student model was fine-tuned with Low-Rank Adaptation (LoRA) on Apple Silicon hardware using the MLX framework. While the base Qwen2.5-7B achieved 64.67% accuracy and the DeepSeek-R1 teacher 91.40%, the fine-tuned student model reached a mean accuracy of 69.43% (std dev 0.17%) on the competition dataset, a 4.76 percentage-point improvement. It also generalized to 73.1% (std dev 0.18%) on the MATH-500 benchmark. The research also found that accuracy declines significantly with reduced response length, from 69.43% at R1 (mean 220 words) to 41.9% at R6 (mean 31.2 words), highlighting response length as a critical factor.

Key takeaway

For Machine Learning Engineers developing compact reasoning models, consider Chain-of-Thought distillation to boost performance. You should implement LoRA fine-tuning on hardware like Apple Silicon, but carefully monitor validation loss and limit training iterations to around 200 to prevent overfitting. Additionally, ensure your models generate sufficiently long responses for complex mathematical reasoning, as accuracy significantly drops with shorter outputs.

Key insights

Chain-of-Thought distillation significantly improves compact models' mathematical reasoning, with response length being a critical factor.

Principles

CoT distillation enhances compact model performance.
Response length is critical for reasoning quality.
Overfitting can occur early in distillation training.

Method

A dual-agent framework generates a Chain-of-Thought corpus from competition problems, used to LoRA fine-tune a student model on Apple Silicon with MLX, limiting training iterations to prevent overfitting.

In practice

Fine-tune with LoRA on Apple Silicon.
Limit training iterations to ~200 to avoid overfitting.
Monitor response length for reasoning tasks.

Topics

Knowledge Distillation
Chain-of-Thought
Mathematical Reasoning
LoRA Fine-tuning
Compact Language Models
Apple Silicon MLX

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.