Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Researchers from Harbin Institute of Technology and Huawei Noah's Ark Lab introduce SAI-DPO (Self-Aware Iterative Direct Preference Optimization), an algorithm designed to enhance mathematical reasoning in large language models (LLMs) by dynamically selecting training data. Unlike static data selection methods, SAI-DPO continuously assesses a model's evolving reasoning abilities and adapts data selection based on real-time performance feedback. The algorithm uses two key metrics: "knowledge points similarity" to group problems by domain and "self-aware difficulty" to gauge a model's current competence, defined by P@K, overall steps, and average output length. Extensive experiments on three LLMs (Qwen2.5-7B-Math-Base, Qwen2.5-7B-Distill, Llama3.1-8B-Instruct) and eight mathematical benchmarks, including AIME24 and AMC23, demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with notable improvements of 10 and 15 points on AIME24 and AMC23, respectively, while also improving data utilization efficiency.

Key takeaway

Research scientists developing or fine-tuning LLMs for mathematical reasoning should consider implementing dynamic data selection strategies like SAI-DPO. By continuously assessing your model's "self-aware difficulty" and "knowledge points similarity," you can adapt training data to target specific weaknesses, leading to substantial performance gains (e.g., 21.3% average boost) and improved data efficiency, especially on complex, competition-level problems. This approach can accelerate convergence and optimize resource allocation compared to static data methods.

Key insights

Dynamically adapting training data to a model's evolving capabilities significantly boosts mathematical reasoning performance.

Principles

Method

SAI-DPO dynamically samples training data using knowledge point similarity and self-aware difficulty metrics (P@K, steps, length), then iteratively refines the model via DPO, prioritizing challenging, relevant problems.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.