Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation
Summary
A new study introduces data curation as a critical, overlooked lever for improving large Vision-Language Model (VLM) inference efficiency by reducing output length. While traditional methods like distillation and quantization shrink models, they do not address inflating token counts. Researchers applied a VLM curation pipeline to the MAmmoTH-VL single-image subset, training models on concise data and comparing them against standard MAmmoTH-VL and external frontier VLMs. On a controlled 20-evaluation set with 14 VLMs (1B-4B activated parameters), the curated models achieved a 35x Cost-of-Pass advantage over the most verbose 4B comparator (Qwen3.5-4B), demonstrating 0.41 TFLOPs per correct answer versus 14.58 TFLOPs, with comparable accuracy (0.691 vs 0.704 mean accuracy). This curation also yielded a +17.55-percentage-point matched-length accuracy gain over the uncurated baseline, increasing with model scale from +16.7 pp at 1B to +21.2 pp at 4B. The study concludes that brevity does not compromise quality, even for reasoning tasks.
Key takeaway
For MLOps Engineers optimizing VLM deployment costs, you should prioritize data curation to induce concision in model outputs. This approach directly reduces inference FLOPs per correct answer by up to 35x, offering substantial efficiency gains without compromising accuracy. Consider integrating data quality checks for brevity into your pretraining pipelines, as generic verbosity provides no performance advantage. Your focus should shift to "tokens-per-correct" as a primary optimization target.
Key insights
Brevity in VLM outputs, achieved through data curation, significantly boosts inference efficiency without sacrificing accuracy.
Principles
- Output length is a key, often ignored, inference efficiency lever.
- Concise pretraining data induces shorter, accurate model responses.
- Generic verbosity offers no accuracy benefits.
Method
A VLM curation pipeline was applied to MAmmoTH-VL data, training models on concise, correct data. Output length was fixed via regression to isolate brevity from quality.
In practice
- Curate pretraining data for VLMs to reduce token count.
- Focus on "tokens-per-correct" as an efficiency metric.
- Evaluate brevity's impact separately from quality.
Topics
- VLM Inference Efficiency
- Data Curation
- Model Brevity
- Token Count Optimization
- MAmmoTH-VL
- Qwen3.5-4B
Code references
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.