EarlyTom: Early Token Compression Completes Fast Video Understanding

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

EarlyTom is a training-free token compression framework designed to enhance the efficiency of Video Large Language Models (Video-LLMs) by addressing the inefficiency of processing massive visual tokens. While previous methods compressed tokens late in prefilling, EarlyTom performs early-stage visual token compression directly inside the vision encoder. This approach targets the significant contribution of vision encoding to the time-to-first-token (TTFT). EarlyTom also introduces a decoupled spatial token selection strategy to improve compression effectiveness. It reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, maintaining comparable accuracy to full-token baselines and significantly improving real-world deployment practicality.

Key takeaway

If you are deploying Video-LLMs in production and facing efficiency bottlenecks, consider integrating early-stage token compression frameworks like EarlyTom. This approach, which reduces time-to-first-token by up to 2.65x and FLOPs by 61% for models like LLaVA-OneVision-7B without accuracy loss, can substantially improve your system's throughput and responsiveness. Prioritizing in-encoder optimization is crucial for practical, real-world Video-LLM applications.

Key insights

Early, in-encoder token compression significantly boosts Video-LLM efficiency by optimizing vision encoding.

Principles

Vision encoding largely contributes to time-to-first-token (TTFT).
Compressing tokens inside the encoder offers substantial efficiency gains.

Method

EarlyTom is a training-free framework that performs early-stage visual token compression within the vision encoder, utilizing a decoupled spatial token selection strategy to improve overall compression effectiveness.

In practice

Apply early-stage token compression within vision encoders.
Implement decoupled spatial token selection for better compression.

Topics

Video Large Language Models
Token Compression
Vision Encoders
Time-to-First-Token
Model Efficiency
LLaVA-OneVision-7B

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.