Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

2026-05-08 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new framework called One-Step-Train (OST) addresses the quality-quantity trade-off in synthetic data for Large Multimodal Models (LMMs) by reframing data selection as an incremental optimization utility ranking problem. Unlike prior methods like LLM-as-a-Judge, OST estimates the marginal utility of each data sample through a simulated single-step update on a lightweight proxy, avoiding semantic heuristics. Experiments with the Qwen series on multimodal mathematical reasoning benchmarks show OST's efficiency. Selecting the top-50 subset reduces training costs by 43% and total time by 17%, outperforming the LLM-as-a-Judge baseline by 1.8 points. With a fixed compute budget, the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge and an 8.8 point gain over the Full-SFT baseline, while also mitigating performance degradation from noisy data.

Key takeaway

For AI Engineers and Research Scientists developing Large Multimodal Models, OST offers a computationally efficient and effective method for synthetic data selection. You should consider integrating OST to reduce training costs by up to 43% and improve model performance by several points, especially when dealing with noisy datasets that can cause negative transfer. This approach provides a clear advantage over traditional LLM-as-a-Judge methods and heuristic scoring.

Key insights

One-Step-Train (OST) optimizes LMM data selection by estimating marginal utility via a lightweight proxy, improving efficiency and performance.

Principles

Reformulate data selection as incremental optimization.
Estimate marginal utility via simulated single-step updates.
Optimization-grounded selection identifies toxic samples.

Method

OST estimates a sample's marginal utility by simulating a single-step update on a lightweight proxy model. This utility is then used to rank and select data, replacing semantic heuristics for LMM training.

In practice

Reduce LMM training costs by 43% with OST.
Improve model performance by 5.6 points over baselines.
Mitigate negative transfer from noisy data.

Topics

Data Selection
Multimodal Models
Incremental Optimization
One-Step-Train
Large Multimodal Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.