Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
Summary
Stream3D is a novel, training-free streaming mechanism designed to convert frozen view-conditioned 3D generators, such as SAM 3D, into streaming generators capable of processing long monocular video streams. It addresses the temporal inconsistency issues arising from naively applying single-view generators to sequential frames. Stream3D achieves this by maintaining a compact "Adaptive Evidential Memory" that selectively caches the most informative historical frames based on a proposed evidence score mechanism. This memory dynamically updates to retain a fixed number of frames, preventing linear memory footprint growth with sequence length (e.g., approximately 65 KB for SAM 3D with Q=4096, D=5). The system avoids retraining or architectural modifications to the underlying 3D generator. Evaluated on GSO and NAVI benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across photometric and geometric metrics, demonstrating improved long-range consistency.
Key takeaway
For Computer Vision Engineers developing real-time 3D reconstruction or generation systems from continuous video feeds, Stream3D offers a robust solution to temporal inconsistency and memory scaling. You should consider integrating this training-free wrapper with your existing view-conditioned 3D generators to achieve superior long-range consistency and bounded memory usage. This approach avoids the pitfalls of latent state transport, ensuring stable performance without retraining or modifying your core models.
Key insights
Stream3D enables consistent 3D generation from long video streams by selectively caching evidential views, avoiding latent state transport.
Principles
- Prioritize evidence selection over latent state transport.
- Maintain constant memory footprint with stream length.
- Token-level evidence improves long-range consistency.
Method
Stream3D computes token-wise evidence scores from cross-attention maps, updates an Adaptive Evidential Memory (fixed capacity 2 × Q × D scalars), and uses token-ownership counts to select top-K conditioning views for Evidence-Based Multi-Generation.
In practice
- Integrate Stream3D as a wrapper for existing 3D generators.
- Use cross-attention maps for view informativeness.
- Dynamically update memory with high-evidence frames.
Topics
- Streaming 3D Generation
- Evidential Memory
- View-Conditioned Generators
- Temporal Consistency
- Cross-Attention Mechanisms
- SAM 3D
Best for: AI Scientist, Computer Vision Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.