Efficient Data Selection for Multimodal Models via Incremental Optimization Utility
Summary
A new framework called One-Step-Train (OST) addresses the quality-quantity trade-off in synthetic data for Large Multimodal Models (LMMs) by reframing data selection as an incremental optimization utility ranking problem. Unlike prior methods like LLM-as-a-Judge, OST estimates the marginal utility of each data sample through a simulated single-step update on a lightweight proxy, avoiding semantic heuristics. Experiments with the Qwen series on multimodal mathematical reasoning benchmarks show OST's efficiency. Selecting the top-50 subset reduces training costs by 43% and total time by 17%, outperforming the LLM-as-a-Judge baseline by 1.8 points. With a fixed compute budget, the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge and an 8.8 point gain over the Full-SFT baseline, while also mitigating performance degradation from noisy data.
Key takeaway
For AI Engineers and Research Scientists developing Large Multimodal Models, OST offers a computationally efficient and effective method for synthetic data selection. You should consider integrating OST to reduce training costs by up to 43% and improve model performance by several points, especially when dealing with noisy datasets that can cause negative transfer. This approach provides a clear advantage over traditional LLM-as-a-Judge methods and heuristic scoring.
Key insights
One-Step-Train (OST) optimizes LMM data selection by estimating marginal utility via a lightweight proxy, improving efficiency and performance.
Principles
- Reformulate data selection as incremental optimization.
- Estimate marginal utility via simulated single-step updates.
- Optimization-grounded selection identifies toxic samples.
Method
OST estimates a sample's marginal utility by simulating a single-step update on a lightweight proxy model. This utility is then used to rank and select data, replacing semantic heuristics for LMM training.
In practice
- Reduce LMM training costs by 43% with OST.
- Improve model performance by 5.6 points over baselines.
- Mitigate negative transfer from noisy data.
Topics
- Data Selection
- Multimodal Models
- Incremental Optimization
- One-Step-Train
- Large Multimodal Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.