Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
Summary
Fre-Res is a novel dual-track video-token compression framework designed to address the tension between spatial fidelity and temporal coverage in Video Multimodal Large Language Models (MLLMs). It preserves sparse, high-fidelity spatial anchors while representing dense temporal evolution through compact residual-frequency tokens. The framework applies temporal 1D-Discrete Cosine Transform (1D-DCT) to inter-frame residual trajectories in vision-latent space, leveraging observed low-frequency concentration. A Spatial-Guided Absorber then injects this temporal residual information into spatially corresponding anchor tokens to align frequency-domain dynamics with native visual embeddings. Fre-Res achieves a favorable accuracy–efficiency trade-off across fine-grained short-video and long-video reasoning benchmarks, matching or approaching full-token performance while substantially reducing visual-token length. For instance, in a 1-minute, 30 FPS video, it reduces context length from over one million to approximately 46,102 tokens, cutting peak VRAM from 41.68 GB to 22.35 GB and Time-To-First-Token (TTFT) from 7.84 seconds to 1.68 seconds on an H100 GPU.
Key takeaway
For AI Engineers developing or deploying video MLLMs, Fre-Res offers a robust strategy to overcome the quadratic prefill attention cost and linear KV-cache memory growth associated with long video sequences. By adopting its dual-track compression, you can significantly reduce visual token length and hardware overhead (e.g., 2.1x to 22.5x compression, 2.8x VRAM reduction over Fourier compression) while preserving critical spatial and temporal reasoning capabilities. Consider implementing Fre-Res to enable practical, scalable long-video understanding without sacrificing accuracy.
Key insights
Fre-Res efficiently compresses video for MLLMs by separating spatial anchors from compact temporal-frequency residuals.
Principles
- Video evidence has distinct spatial and temporal roles.
- Inter-frame residuals in latent space exhibit strong low-frequency concentration.
- Aligning frequency-domain dynamics with visual embeddings is crucial.
Method
Fre-Res uses a dual-track approach: sparse raw anchors for spatial fidelity and temporal 1D-DCT on latent inter-frame residuals for temporal dynamics, fused via a Spatial-Guided Absorber.
In practice
- Use 1D-DCT on latent residuals for compact motion representation.
- Employ sparse spatial anchors for object and layout reasoning.
- Integrate temporal-frequency data into spatial tokens via cross-attention.
Topics
- Fre-Res
- Video Token Compression
- Video MLLMs
- Temporal 1D-DCT
- Spatial-Guided Absorber
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.