S23DR 2026 Winning Solution
Summary
The winning solution for the S23DR 2026 challenge addresses structured 3D wireframe reconstruction from sparse Structure-from-Motion (SfM), fitted depth, and semantic segmentations. This method treats vertices as a conditional set, denoising 64 vertex tokens using a flow-matching DiT conditioned on Perceiver-style scene tokens. The system operates in two stages: a global pass predicts a coarse structure, followed by a hull-cropped second pass that refines it. A multi-sample consensus step further stabilizes stochastic predictions. This approach achieved first place on the private leaderboard with a Hybrid Structure Score (HSS) of 0.654, surpassing the second-place entry (0.648) and significantly outperforming learned (0.474) and handcrafted (0.391) baselines. It also recorded the highest vertex F1 score of 0.791. The model was trained on the HoHo dataset using 4 H200 GPUs.
Key takeaway
For Computer Vision Engineers developing 3D reconstruction systems from sparse data, you should consider a multi-stage, conditional set generation approach. This method, which won the S23DR 2026 challenge, demonstrates superior performance by refining coarse global predictions with localized, hull-cropped passes. Implementing ensemble inference can significantly reduce prediction variance and improve robustness, especially when dealing with noisy inputs like sparse SfM and depth maps.
Key insights
A two-stage flow-matching DiT with scene conditioning effectively reconstructs 3D wireframes from noisy, sparse inputs.
Principles
- Conditional set generation outperforms 2D-to-3D lifting.
- Multi-stage refinement improves metric accuracy.
- Ensemble consensus reduces stochastic variance.
Method
The method uses a Perceiver-style scene encoder and a DiT denoiser in two stages: global coarse prediction, then hull-cropped refinement. Ensemble inference aggregates 16 stochastic trajectories.
In practice
- Enrich sparse point clouds with semantic features.
- Use convex hull cropping for localized refinement.
- Apply multi-sample consensus for robust inference.
Topics
- 3D Wireframe Reconstruction
- S23DR Challenge
- Flow Matching
- Diffusion Transformers
- Perceiver Architecture
- Structure-from-Motion
- Semantic Segmentation
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.