LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

2026-02-13 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

LongStream is a novel gauge-decoupled streaming visual geometry model designed for stable, metric-scale 3D scene reconstruction over thousands of frames. Existing autoregressive models struggle with long sequences due to first-frame pose anchoring, which causes attention decay, scale drift, and extrapolation errors. LongStream addresses these issues by predicting keyframe-relative poses, transforming long-range extrapolation into a constant-difficulty local task. It also incorporates orthogonal scale learning to disentangle geometry from scale estimation, effectively suppressing drift. Furthermore, the model tackles Transformer cache problems like attention-sink reliance and KV-cache contamination through cache-consistent training and periodic cache refresh, which prevents attention degradation and reduces the training-inference gap. LongStream achieves state-of-the-art performance, reconstructing kilometer-scale sequences at 18 FPS.

Key takeaway

For research scientists developing streaming 3D reconstruction systems, LongStream's architectural innovations offer a robust solution to long-sequence instability. You should consider adopting keyframe-relative pose prediction and orthogonal scale learning to mitigate drift, alongside cache-consistent training and periodic cache refresh for Transformer-based models to achieve stable, metric-scale performance over extended sequences.

Key insights

LongStream enables stable, metric-scale 3D reconstruction for ultra-long sequences by decoupling gauge and optimizing Transformer caches.

Principles

Anchor poses to keyframes, not the first frame.
Disentangle geometry from scale estimation.
Refresh Transformer KV-caches periodically.

Method

LongStream predicts keyframe-relative poses, employs orthogonal scale learning, and uses cache-consistent training with periodic cache refresh to manage Transformer attention over long sequences.

In practice

Use keyframe-relative pose prediction for long sequences.
Implement orthogonal scale learning to reduce drift.
Apply cache-consistent training for Transformer stability.

Topics

3D Reconstruction
Streaming Visual Geometry
Autoregressive Models
Transformer Architectures
LongStream

Best for: Research Scientist, AI Researcher, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.