Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

2024-05-10 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Multimodal AI & Video Understanding · Depth: Expert, extended

Summary

Fre-Res is a novel dual-track video-token compression framework designed to address the tension between spatial fidelity and temporal coverage in Video Multimodal Large Language Models (MLLMs). It preserves sparse, high-fidelity spatial anchors while representing dense temporal evolution through compact residual-frequency tokens. The framework applies temporal 1D-Discrete Cosine Transform (1D-DCT) to inter-frame residual trajectories in vision-latent space, leveraging observed low-frequency concentration. A Spatial-Guided Absorber then injects this temporal residual information into spatially corresponding anchor tokens to align frequency-domain dynamics with native visual embeddings. Fre-Res achieves a favorable accuracy–efficiency trade-off across fine-grained short-video and long-video reasoning benchmarks, matching or approaching full-token performance while substantially reducing visual-token length. For instance, in a 1-minute, 30 FPS video, it reduces context length from over one million to approximately 46,102 tokens, cutting peak VRAM from 41.68 GB to 22.35 GB and Time-To-First-Token (TTFT) from 7.84 seconds to 1.68 seconds on an H100 GPU.

Key takeaway

For AI Engineers developing or deploying video MLLMs, Fre-Res offers a robust strategy to overcome the quadratic prefill attention cost and linear KV-cache memory growth associated with long video sequences. By adopting its dual-track compression, you can significantly reduce visual token length and hardware overhead (e.g., 2.1x to 22.5x compression, 2.8x VRAM reduction over Fourier compression) while preserving critical spatial and temporal reasoning capabilities. Consider implementing Fre-Res to enable practical, scalable long-video understanding without sacrificing accuracy.

Key insights

Fre-Res efficiently compresses video for MLLMs by separating spatial anchors from compact temporal-frequency residuals.

Principles

Video evidence has distinct spatial and temporal roles.
Inter-frame residuals in latent space exhibit strong low-frequency concentration.
Aligning frequency-domain dynamics with visual embeddings is crucial.

Method

Fre-Res uses a dual-track approach: sparse raw anchors for spatial fidelity and temporal 1D-DCT on latent inter-frame residuals for temporal dynamics, fused via a Spatial-Guided Absorber.

In practice

Use 1D-DCT on latent residuals for compact motion representation.
Employ sparse spatial anchors for object and layout reasoning.
Integrate temporal-frequency data into spatial tokens via cross-attention.

Topics

Fre-Res
Video Token Compression
Video MLLMs
Temporal 1D-DCT
Spatial-Guided Absorber

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.