DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

DRIFT is a novel framework designed to adapt pretrained Vision-Language Models (VLMs) for tasks requiring precise continuous outputs, addressing a limitation of current VLMs that rely on discrete token autoregressive decoding. While text-based outputs are effective for scalable pretraining and zero-shot generalization, they struggle with problems like localizing temporal event boundaries or generating robotic control actions. DRIFT tackles this by integrating a base predictor, which offers a coarse initial estimate, with a generative refinement module utilizing flow matching for iterative prediction improvement. This residual approach simplifies optimization by focusing on modeling a localized residual distribution around a strong prior, rather than a complex global output distribution. Evaluated on perception and planning tasks, including visual grounding and robotic control, DRIFT consistently surpassed strong regression- and generative-based solutions across various architectures such as MLLMs, VLAs, and WAMs.

Key takeaway

For Machine Learning Engineers developing Vision-Language Models for tasks requiring precise continuous outputs, DRIFT offers a robust solution. If your current VLM struggles with applications like robotic control or accurate temporal boundary localization due to discrete token decoding, you should consider integrating DRIFT's residual flow adapter. This approach simplifies optimization and consistently outperforms existing regression and generative methods, enabling your models to achieve higher precision in critical real-world scenarios.

Key insights

DRIFT adapts VLMs for precise continuous outputs by combining a coarse base prediction with flow matching for iterative residual refinement.

Principles

Residual formulation simplifies generative modeling.
Iterative refinement improves coarse initial estimates.
Flow matching effectively models localized distributions.

Method

DRIFT combines a base predictor for coarse estimates with a flow matching-based generative refinement module. This module iteratively improves predictions by modeling a localized residual distribution around the base prior.

In practice

Adapt VLMs for robotic control actions.
Improve visual grounding precision.
Enhance temporal event boundary localization.

Topics

Vision-Language Models
Continuous Output Decoding
Flow Matching
Robotic Control
Visual Grounding
Generative Models

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.