LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
Summary
LongStream is a novel gauge-decoupled streaming visual geometry model designed for stable, metric-scale 3D scene reconstruction over thousands of frames. Existing autoregressive models struggle with long sequences due to first-frame pose anchoring, which causes attention decay, scale drift, and extrapolation errors. LongStream addresses these issues by predicting keyframe-relative poses, transforming long-range extrapolation into a constant-difficulty local task. It also incorporates orthogonal scale learning to disentangle geometry from scale estimation, effectively suppressing drift. Furthermore, the model tackles Transformer cache problems like attention-sink reliance and KV-cache contamination through cache-consistent training and periodic cache refresh, which prevents attention degradation and reduces the training-inference gap. LongStream achieves state-of-the-art performance, reconstructing kilometer-scale sequences at 18 FPS.
Key takeaway
For research scientists developing streaming 3D reconstruction systems, LongStream's architectural innovations offer a robust solution to long-sequence instability. You should consider adopting keyframe-relative pose prediction and orthogonal scale learning to mitigate drift, alongside cache-consistent training and periodic cache refresh for Transformer-based models to achieve stable, metric-scale performance over extended sequences.
Key insights
LongStream enables stable, metric-scale 3D reconstruction for ultra-long sequences by decoupling gauge and optimizing Transformer caches.
Principles
- Anchor poses to keyframes, not the first frame.
- Disentangle geometry from scale estimation.
- Refresh Transformer KV-caches periodically.
Method
LongStream predicts keyframe-relative poses, employs orthogonal scale learning, and uses cache-consistent training with periodic cache refresh to manage Transformer attention over long sequences.
In practice
- Use keyframe-relative pose prediction for long sequences.
- Implement orthogonal scale learning to reduce drift.
- Apply cache-consistent training for Transformer stability.
Topics
- 3D Reconstruction
- Streaming Visual Geometry
- Autoregressive Models
- Transformer Architectures
- LongStream
Best for: Research Scientist, AI Researcher, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.