Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
Summary
Researchers from Harbin Institute of Technology and Huawei Noah's Ark Lab introduce SAI-DPO (Self-Aware Iterative Direct Preference Optimization), an algorithm designed to enhance mathematical reasoning in large language models (LLMs) by dynamically selecting training data. Unlike static data selection methods, SAI-DPO continuously assesses a model's evolving reasoning abilities and adapts data selection based on real-time performance feedback. The algorithm uses two key metrics: "knowledge points similarity" to group problems by domain and "self-aware difficulty" to gauge a model's current competence, defined by P@K, overall steps, and average output length. Extensive experiments on three LLMs (Qwen2.5-7B-Math-Base, Qwen2.5-7B-Distill, Llama3.1-8B-Instruct) and eight mathematical benchmarks, including AIME24 and AMC23, demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with notable improvements of 10 and 15 points on AIME24 and AMC23, respectively, while also improving data utilization efficiency.
Key takeaway
Research scientists developing or fine-tuning LLMs for mathematical reasoning should consider implementing dynamic data selection strategies like SAI-DPO. By continuously assessing your model's "self-aware difficulty" and "knowledge points similarity," you can adapt training data to target specific weaknesses, leading to substantial performance gains (e.g., 21.3% average boost) and improved data efficiency, especially on complex, competition-level problems. This approach can accelerate convergence and optimize resource allocation compared to static data methods.
Key insights
Dynamically adapting training data to a model's evolving capabilities significantly boosts mathematical reasoning performance.
Principles
- Data selection should align with model competence.
- Dynamic data adaptation outperforms static strategies.
- Appropriate difficulty metrics are crucial for training.
Method
SAI-DPO dynamically samples training data using knowledge point similarity and self-aware difficulty metrics (P@K, steps, length), then iteratively refines the model via DPO, prioritizing challenging, relevant problems.
In practice
- Use K-Means to cluster knowledge points for data categorization.
- Filter training data to exclude overly easy or difficult samples.
- Prioritize data from error-prone knowledge domains.
Topics
- SAI-DPO
- Dynamic Data Sampling
- Mathematical Reasoning
- Direct Preference Optimization
- Self-Aware Difficulty
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.