What Should a Streaming Video Model Remember?
Summary
SelectStream, a novel selective latent-memory framework, addresses the critical challenge of memory allocation in streaming video understanding models by exposing historical information only through a compact, query-conditioned evidence budget. This framework keeps the current observation directly visible to a frozen VLM while selectively injecting historical data. It employs three coordinated mechanisms: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. SelectStream achieves strong online streaming performance, reaching 82.67% on StreamingBench, 67.03% on OVO-Bench, and 74.4% average accuracy on offline video benchmarks, outperforming strong recent-window baselines and prior streaming memory methods. The model was published on 2026-06-15.
Key takeaway
For Computer Vision Engineers designing streaming video understanding models, you should prioritize selective memory allocation over indiscriminate history injection. SelectStream demonstrates that a compact, query-conditioned evidence budget, managed by adaptive windowing and graph reasoning, significantly improves performance. This approach allows your models to maintain strong current-scene perception while effectively leveraging historical context, achieving superior results on benchmarks like StreamingBench and OVO-Bench.
Key insights
SelectStream selectively allocates latent memory for streaming video understanding, balancing current perception with historical context.
Principles
- Indiscriminate history dilutes current perception.
- Memory allocation must be selective and budgeted.
- Query-conditioned evidence injection is key.
Method
SelectStream uses surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph to inject calibrated evidence as latent tokens.
In practice
- Integrate latent tokens for answer generation.
- Avoid replaying frames or growing context.
Topics
- Streaming Video
- Video Understanding
- Latent Memory
- Selective Attention
- Graph Reasoning
- Online Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.