ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

ViCoStream is a stage-wise coordinated streaming framework designed to enhance the real-time performance of Streaming VideoLLMs. It addresses the critical need for high video-ingestion throughput and low query latency by formulating inference as a coordinated pipeline encompassing visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Unlike methods focusing on individual module acceleration, ViCoStream integrates chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to manage per-chunk computation and memory costs. A systematic study within ViCoStream explores how chunk size, token retention, attention locality, and retrieval scope influence the throughput-accuracy trade-off. Experiments using Qwen2.5-VL-3B/7B-Instruct on streaming benchmarks demonstrate that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU, while maintaining accuracy comparable to full-history baselines.

Key takeaway

For Machine Learning Engineers deploying VideoLLMs in real-time streaming applications, ViCoStream's coordinated inference approach offers a significant performance blueprint. You should consider integrating stage-wise coordination, chunk-wise execution, and CUDA-stream overlap to achieve high throughput like 134 FPS and sub-50 ms TTFT. This framework helps you manage computational costs and maintain accuracy, making real-time VideoLLM deployment feasible on single A100 GPUs.

Key insights

Streaming VideoLLM inference benefits from a coordinated pipeline approach to achieve high throughput and low latency.

Principles

Method

ViCoStream coordinates visual preprocessing, encoding, token dropping, and LLM prefilling/decoding. It uses chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.