AdaCodec: A Predictive Visual Code for Video MLLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

AdaCodec is a novel predictive visual code designed for video multimodal large language models (MLLMs) that addresses the inefficiency of encoding each video frame independently. Traditional video MLLMs generate redundant visual tokens by processing adjacent, often similar, frames as distinct RGB images. AdaCodec mitigates this by transmitting a full reference frame only when the scene cannot be accurately predicted from prior context, indicated by a high conditional predictive cost. Otherwise, it encodes inter-frame changes, such as motion and prediction residuals, into compact P-tokens. This approach significantly improves performance, outperforming the Qwen3-VL-8B per-frame RGB baseline across all eleven benchmarks at a matched visual-token budget. Notably, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks at 1/7 the budget, and on five general-video benchmarks, it boosts average scores while reducing time-to-first-token from 9.26s to 1.62s.

Key takeaway

For Machine Learning Engineers optimizing video MLLM performance or managing computational budgets, AdaCodec presents a compelling approach. By dynamically encoding video frames based on predictive cost, you can drastically reduce visual token redundancy and improve processing efficiency. This method not only enhances performance on long-video benchmarks but also cuts time-to-first-token from 9.26s to 1.62s. You should explore integrating predictive visual codes to achieve substantial gains in both speed and resource utilization for your video-based AI applications.

Key insights

AdaCodec uses a predictive visual code to reduce redundancy in video MLLMs, improving efficiency and performance.

Principles

Method

AdaCodec encodes full reference frames only when conditional predictive cost is high; otherwise, it transmits compact P-tokens for inter-frame motion and residuals.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.