Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects
Summary
Every9D-21M is a new large-scale dataset designed to address the challenge of 9D pose estimation for everyday objects from single real-world images, a task previously hindered by insufficient supervision. This dataset comprises 21.8 million real-world images derived from 109,000 object-centric videos, covering 700 distinct everyday object categories. This scale represents a two orders of magnitude increase over previous real-world 9D pose benchmarks in both image and category count. The creation method involves reconstructing object-level point clouds using multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for less than 0.01% of images and then propagated and verified across the remaining instances. The dataset also includes cross-category orientation rules for symmetry-aware evaluation. Training on Every9D-21M significantly improves performance on ImageNet3D and PASCAL3D+, and demonstrates superior generalization to HANDAL compared to ImageNet3D.
Key takeaway
For AI Scientists developing 9D pose estimation models, you should consider Every9D-21M as a foundational training resource. Its unprecedented scale of 21.8 million real-world images across 700 categories offers superior generalization capabilities compared to smaller, synthetic datasets. Integrating this dataset can significantly improve your model's performance on real-world benchmarks like ImageNet3D and PASCAL3D+, and enhance generalization to novel environments such as HANDAL. Explore the provided data and code to accelerate your research.
Key insights
Large-scale, real-world 9D pose datasets can be built efficiently by propagating sparse manual annotations across object-centric videos.
Principles
- Multi-view geometry enables object-level point cloud reconstruction.
- Cross-instance alignment propagates canonical poses effectively.
- Cross-category rules induce symmetry for robust evaluation.
Method
Reconstruct point clouds from object-centric videos, align instances to a shared canonical frame, manually annotate a small reference set, propagate poses, and verify.
In practice
- Use object-centric videos for scalable data generation.
- Implement cross-instance alignment for pose propagation.
- Apply cross-category rules for symmetry-aware evaluation.
Topics
- 9D Pose Estimation
- Object Canonicalization
- Large-Scale Datasets
- Multi-View Geometry
- Real-World Objects
- Foundation Models
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.