AdaCodec: A Predictive Visual Code for Video MLLMs
Summary
AdaCodec is a novel predictive visual code designed for video multimodal large language models (MLLMs) that addresses the inefficiency of encoding each video frame independently. Traditional video MLLMs generate redundant visual tokens by processing adjacent, often similar, frames as distinct RGB images. AdaCodec mitigates this by transmitting a full reference frame only when the scene cannot be accurately predicted from prior context, indicated by a high conditional predictive cost. Otherwise, it encodes inter-frame changes, such as motion and prediction residuals, into compact P-tokens. This approach significantly improves performance, outperforming the Qwen3-VL-8B per-frame RGB baseline across all eleven benchmarks at a matched visual-token budget. Notably, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks at 1/7 the budget, and on five general-video benchmarks, it boosts average scores while reducing time-to-first-token from 9.26s to 1.62s.
Key takeaway
For Machine Learning Engineers optimizing video MLLM performance or managing computational budgets, AdaCodec presents a compelling approach. By dynamically encoding video frames based on predictive cost, you can drastically reduce visual token redundancy and improve processing efficiency. This method not only enhances performance on long-video benchmarks but also cuts time-to-first-token from 9.26s to 1.62s. You should explore integrating predictive visual codes to achieve substantial gains in both speed and resource utilization for your video-based AI applications.
Key insights
AdaCodec uses a predictive visual code to reduce redundancy in video MLLMs, improving efficiency and performance.
Principles
- Video MLLMs benefit from inter-frame redundancy exploitation.
- Conditional prediction cost guides efficient frame encoding.
- Compact tokens can represent inter-frame changes effectively.
Method
AdaCodec encodes full reference frames only when conditional predictive cost is high; otherwise, it transmits compact P-tokens for inter-frame motion and residuals.
In practice
- Reduce visual token budget for video MLLMs.
- Improve long-video benchmark performance.
- Accelerate time-to-first-token in video processing.
Topics
- Video MLLMs
- Predictive Visual Code
- AdaCodec
- Token Efficiency
- Inter-frame Compression
- Low Latency AI
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.