MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

MotionEnhancer is a novel approach that significantly enhances the motion understanding capabilities of Vision-Language Models (VLMs) by incorporating motion priors distilled from Video Diffusion Models (VDMs). Current VLMs often struggle with fine-grained motion details, focusing instead on high-level static semantics. MotionEnhancer addresses this by using two parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to extract and optimize motion-related attention maps from a VDM like CogVideoX-1.5-5B. These priors then serve as auxiliary supervision to guide VLM attention alignment during supervised fine-tuning. Experiments show consistent improvements over VLMs, including Qwen2.5-VL (3B, 7B) and InternVL3 (2B, 8B), on motion-level benchmarks like MotionBench and FAVOR-Bench, with gains up to 11.7% on specific metrics. This scalable solution requires no additional training parameters or architectural modifications.

Key takeaway

For AI Scientists and Machine Learning Engineers developing video understanding models, MotionEnhancer offers a data-efficient method to significantly improve fine-grained motion perception. By integrating motion priors from Video Diffusion Models, your existing VLMs can achieve competitive performance with substantially larger architectures, even with less training data. Consider applying this parameter-free attention alignment strategy to enhance temporal reasoning without complex architectural changes or extensive data re-collection.

Key insights

Video Diffusion Model attention provides motion-calibrated priors to enhance Vision-Language Model motion understanding via attention alignment.

Principles

Method

MotionEnhancer extracts VDM attention via 5-step DDIM inversion/denoising, refines it with Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), then aligns it with VLM attention using an L2-norm MSE loss during supervised fine-tuning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.