MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

MotionEnhancer is a novel approach that significantly enhances the motion understanding capabilities of Vision-Language Models (VLMs) by incorporating motion priors distilled from Video Diffusion Models (VDMs). Current VLMs often struggle with fine-grained motion details, focusing instead on high-level static semantics. MotionEnhancer addresses this by using two parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to extract and optimize motion-related attention maps from a VDM like CogVideoX-1.5-5B. These priors then serve as auxiliary supervision to guide VLM attention alignment during supervised fine-tuning. Experiments show consistent improvements over VLMs, including Qwen2.5-VL (3B, 7B) and InternVL3 (2B, 8B), on motion-level benchmarks like MotionBench and FAVOR-Bench, with gains up to 11.7% on specific metrics. This scalable solution requires no additional training parameters or architectural modifications.

Key takeaway

For AI Scientists and Machine Learning Engineers developing video understanding models, MotionEnhancer offers a data-efficient method to significantly improve fine-grained motion perception. By integrating motion priors from Video Diffusion Models, your existing VLMs can achieve competitive performance with substantially larger architectures, even with less training data. Consider applying this parameter-free attention alignment strategy to enhance temporal reasoning without complex architectural changes or extensive data re-collection.

Key insights

Video Diffusion Model attention provides motion-calibrated priors to enhance Vision-Language Model motion understanding via attention alignment.

Principles

VLMs' discriminative $p(t|V)$ distribution mismatches motion's $p(V|t)$ need.
VDMs' generative attention naturally approximates motion-seeking $p(V|t)$.
Attention alignment transfers generative motion priors to discriminative VLMs.

Method

MotionEnhancer extracts VDM attention via 5-step DDIM inversion/denoising, refines it with Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), then aligns it with VLM attention using an L2-norm MSE loss during supervised fine-tuning.

In practice

Use 5-step DDIM inversion for VDM attention extraction.
Select top 50% motion-relevant VDM attention heads and text tokens.
Apply L2-norm MSE loss for VLM-VDM attention alignment.

Topics

Vision-Language Models
Video Diffusion Models
Motion Understanding
Attention Alignment
Supervised Fine-tuning
Video Question Answering
CogVideoX

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.