ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference
Summary
ViCoStream is a stage-wise coordinated streaming framework designed to enhance the real-time performance of Streaming VideoLLMs. It addresses the critical need for high video-ingestion throughput and low query latency by formulating inference as a coordinated pipeline encompassing visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Unlike methods focusing on individual module acceleration, ViCoStream integrates chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to manage per-chunk computation and memory costs. A systematic study within ViCoStream explores how chunk size, token retention, attention locality, and retrieval scope influence the throughput-accuracy trade-off. Experiments using Qwen2.5-VL-3B/7B-Instruct on streaming benchmarks demonstrate that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU, while maintaining accuracy comparable to full-history baselines.
Key takeaway
For Machine Learning Engineers deploying VideoLLMs in real-time streaming applications, ViCoStream's coordinated inference approach offers a significant performance blueprint. You should consider integrating stage-wise coordination, chunk-wise execution, and CUDA-stream overlap to achieve high throughput like 134 FPS and sub-50 ms TTFT. This framework helps you manage computational costs and maintain accuracy, making real-time VideoLLM deployment feasible on single A100 GPUs.
Key insights
Streaming VideoLLM inference benefits from a coordinated pipeline approach to achieve high throughput and low latency.
Principles
- Coordinated pipeline inference optimizes VideoLLM streaming.
- Chunk-wise execution bounds computation and memory.
- Bottleneck migration shapes throughput-accuracy trade-offs.
Method
ViCoStream coordinates visual preprocessing, encoding, token dropping, and LLM prefilling/decoding. It uses chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval.
In practice
- Implement chunk-wise processing for VideoLLM streams.
- Explore CUDA-stream overlap for pipeline acceleration.
- Analyze bottleneck migration with varying chunk sizes.
Topics
- Streaming VideoLLMs
- Coordinated Inference
- Real-time Performance
- Qwen2.5-VL
- GPU Acceleration
- Visual Token Control
Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.