Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

2026-05-27 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, medium

Summary

Every9D-21M is a new large-scale dataset designed to address the challenge of 9D pose estimation for everyday objects from single real-world images, a task previously hindered by insufficient supervision. This dataset comprises 21.8 million real-world images derived from 109,000 object-centric videos, covering 700 distinct everyday object categories. This scale represents a two orders of magnitude increase over previous real-world 9D pose benchmarks in both image and category count. The creation method involves reconstructing object-level point clouds using multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for less than 0.01% of images and then propagated and verified across the remaining instances. The dataset also includes cross-category orientation rules for symmetry-aware evaluation. Training on Every9D-21M significantly improves performance on ImageNet3D and PASCAL3D+, and demonstrates superior generalization to HANDAL compared to ImageNet3D.

Key takeaway

For AI Scientists developing 9D pose estimation models, you should consider Every9D-21M as a foundational training resource. Its unprecedented scale of 21.8 million real-world images across 700 categories offers superior generalization capabilities compared to smaller, synthetic datasets. Integrating this dataset can significantly improve your model's performance on real-world benchmarks like ImageNet3D and PASCAL3D+, and enhance generalization to novel environments such as HANDAL. Explore the provided data and code to accelerate your research.

Key insights

Large-scale, real-world 9D pose datasets can be built efficiently by propagating sparse manual annotations across object-centric videos.

Principles

Multi-view geometry enables object-level point cloud reconstruction.
Cross-instance alignment propagates canonical poses effectively.
Cross-category rules induce symmetry for robust evaluation.

Method

Reconstruct point clouds from object-centric videos, align instances to a shared canonical frame, manually annotate a small reference set, propagate poses, and verify.

In practice

Use object-centric videos for scalable data generation.
Implement cross-instance alignment for pose propagation.
Apply cross-category rules for symmetry-aware evaluation.

Topics

9D Pose Estimation
Object Canonicalization
Large-Scale Datasets
Multi-View Geometry
Real-World Objects
Foundation Models

Code references

GenIntel/Every9D

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.