Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Researchers from Harbin Institute of Technology and Huawei Noah's Ark Lab introduce SAI-DPO (Self-Aware Iterative Direct Preference Optimization), an algorithm designed to enhance mathematical reasoning in large language models (LLMs) by dynamically selecting training data. Unlike static data selection methods, SAI-DPO continuously assesses a model's evolving reasoning abilities and adapts data selection based on real-time performance feedback. The algorithm uses two key metrics: "knowledge points similarity" to group problems by domain and "self-aware difficulty" to gauge a model's current competence, defined by P@K, overall steps, and average output length. Extensive experiments on three LLMs (Qwen2.5-7B-Math-Base, Qwen2.5-7B-Distill, Llama3.1-8B-Instruct) and eight mathematical benchmarks, including AIME24 and AMC23, demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with notable improvements of 10 and 15 points on AIME24 and AMC23, respectively, while also improving data utilization efficiency.

Key takeaway

Research scientists developing or fine-tuning LLMs for mathematical reasoning should consider implementing dynamic data selection strategies like SAI-DPO. By continuously assessing your model's "self-aware difficulty" and "knowledge points similarity," you can adapt training data to target specific weaknesses, leading to substantial performance gains (e.g., 21.3% average boost) and improved data efficiency, especially on complex, competition-level problems. This approach can accelerate convergence and optimize resource allocation compared to static data methods.

Key insights

Dynamically adapting training data to a model's evolving capabilities significantly boosts mathematical reasoning performance.

Principles

Data selection should align with model competence.
Dynamic data adaptation outperforms static strategies.
Appropriate difficulty metrics are crucial for training.

Method

SAI-DPO dynamically samples training data using knowledge point similarity and self-aware difficulty metrics (P@K, steps, length), then iteratively refines the model via DPO, prioritizing challenging, relevant problems.

In practice

Use K-Means to cluster knowledge points for data categorization.
Filter training data to exclude overly easy or difficult samples.
Prioritize data from error-prone knowledge domains.

Topics

SAI-DPO
Dynamic Data Sampling
Mathematical Reasoning
Direct Preference Optimization
Self-Aware Difficulty

Code references

tatsu-lab/stanford_alpaca

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.