DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

DRIFT is a general framework designed to adapt pretrained Vision-Language Models (VLMs) for tasks requiring precise continuous outputs. Unlike traditional VLMs that rely on autoregressive decoding of discrete tokens, which struggles with applications like localizing temporal event boundaries or generating robotic control actions, DRIFT offers a solution. It operates by combining a base predictor that provides an initial coarse estimate of the target output with a generative refinement module. This module uses flow matching to iteratively improve the prediction. The core innovation lies in its residual formulation, which transforms the generative modeling problem into modeling a localized residual distribution around a strong prior, significantly simplifying optimization. DRIFT has been successfully evaluated on both perception and planning tasks, including visual grounding and robotic control, consistently outperforming existing regression- and generative-based solutions across various architectures like MLLMs, VLAs, and WAMs.

Key takeaway

For Machine Learning Engineers developing VLMs for real-world control or precise localization tasks, DRIFT offers a superior approach to handling continuous outputs. You should consider integrating its residual flow adapter framework to overcome the limitations of discrete token decoding. This method significantly simplifies optimization and consistently outperforms traditional regression or generative solutions, enabling more accurate and robust system performance in applications like robotics and visual grounding.

Key insights

DRIFT adapts VLMs for precise continuous outputs by refining coarse predictions with a flow matching-based residual generative module.

Principles

Residual formulation simplifies generative modeling.
Coarse-to-fine refinement improves continuous output precision.
Discrete token decoding limits continuous output tasks.

Method

DRIFT combines a base predictor for coarse estimates with a flow matching-based generative refinement module. This module iteratively improves predictions by modeling a localized residual distribution.

In practice

Localizing temporal event boundaries.
Generating robotic control actions.
Visual grounding tasks.

Topics

Vision-Language Models
Continuous Output Decoding
Flow Matching
Robotic Control
Visual Grounding
Generative Refinement

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.