Are Object-Centric Representations Better At Compositional Generalization?

2026-02-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A new Visual Question Answering benchmark, spanning CLEVRTex, Super-CLEVR, and MOVi-C visual worlds, evaluates the compositional generalization capabilities of vision encoders. The study compares DINOv2 and SigLIP2, both with and without object-centric (OC) biases, meticulously controlling for training data diversity, sample size, representation size, downstream model capacity, and computational resources. Key findings indicate that OC approaches excel in more challenging compositional generalization scenarios. Conversely, original dense representations only outperform OC models in simpler settings and demand significantly more downstream compute. Furthermore, OC models demonstrate superior sample efficiency, achieving robust generalization with less image data, while dense encoders require ample data and diversity to match or exceed their performance. This suggests OC representations provide stronger compositional generalization under constraints on dataset size, training data diversity, or downstream compute.

Key takeaway

For Computer Vision Engineers developing models for visually rich environments, especially those facing constraints on data size, diversity, or compute, you should prioritize object-centric representations. These models offer stronger compositional generalization in challenging scenarios and are more sample efficient, allowing for robust performance with fewer images. This approach can significantly improve model adaptability to novel combinations of concepts.

Key insights

Object-centric representations enhance compositional generalization, especially under data or compute constraints.

Principles

OC models generalize better in harder settings.
OC models are more sample efficient.
Dense representations need more data and compute.

Method

A Visual Question Answering benchmark across CLEVRTex, Super-CLEVR, and MOVi-C measures compositional generalization, comparing DINOv2/SigLIP2 with their object-centric counterparts under controlled conditions.

In practice

Prioritize OC models for limited datasets.
Consider OC for compute-constrained environments.

Topics

Compositional Generalization
Object-Centric Representations
Visual Question Answering
Vision Encoders
DINOv2

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.