Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, extended

Summary

Stream3D is a novel, training-free streaming mechanism designed to convert frozen view-conditioned 3D generators, such as SAM 3D, into streaming generators capable of processing long monocular video streams. It addresses the temporal inconsistency issues arising from naively applying single-view generators to sequential frames. Stream3D achieves this by maintaining a compact "Adaptive Evidential Memory" that selectively caches the most informative historical frames based on a proposed evidence score mechanism. This memory dynamically updates to retain a fixed number of frames, preventing linear memory footprint growth with sequence length (e.g., approximately 65 KB for SAM 3D with Q=4096, D=5). The system avoids retraining or architectural modifications to the underlying 3D generator. Evaluated on GSO and NAVI benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across photometric and geometric metrics, demonstrating improved long-range consistency.

Key takeaway

For Computer Vision Engineers developing real-time 3D reconstruction or generation systems from continuous video feeds, Stream3D offers a robust solution to temporal inconsistency and memory scaling. You should consider integrating this training-free wrapper with your existing view-conditioned 3D generators to achieve superior long-range consistency and bounded memory usage. This approach avoids the pitfalls of latent state transport, ensuring stable performance without retraining or modifying your core models.

Key insights

Stream3D enables consistent 3D generation from long video streams by selectively caching evidential views, avoiding latent state transport.

Principles

Prioritize evidence selection over latent state transport.
Maintain constant memory footprint with stream length.
Token-level evidence improves long-range consistency.

Method

Stream3D computes token-wise evidence scores from cross-attention maps, updates an Adaptive Evidential Memory (fixed capacity 2 × Q × D scalars), and uses token-ownership counts to select top-K conditioning views for Evidence-Based Multi-Generation.

In practice

Integrate Stream3D as a wrapper for existing 3D generators.
Use cross-attention maps for view informativeness.
Dynamically update memory with high-evidence frames.

Topics

Streaming 3D Generation
Evidential Memory
View-Conditioned Generators
Temporal Consistency
Cross-Attention Mechanisms
SAM 3D

Best for: AI Scientist, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.