Image Generators are Generalist Vision Learners
Summary
Vision Banana, a new generalist vision model, demonstrates that image generation training serves as powerful pretraining for visual understanding. Developed by instruction-tuning the Nano Banana Pro image generator on a mix of its original data and a small amount of vision task data, Vision Banana reframes perception tasks as image generation by outputting RGB images. This approach achieves state-of-the-art results across various 2D and 3D vision tasks, surpassing or rivaling zero-shot domain specialists like Segment Anything Model 3 on segmentation and Depth Anything 3 on metric depth estimation. Crucially, it retains the base model's image generation capabilities, with win rates of 53.5% on GenAI-Bench and 47.8% on ImgEdit. This work suggests a paradigm shift where generative vision pretraining becomes central to building Foundational Vision Models for both generation and understanding.
Key takeaway
For AI Scientists and Machine Learning Engineers developing next-generation vision systems, this research indicates that generative pretraining with instruction-tuning offers a powerful, unified approach. You should consider image generators like Nano Banana Pro as foundational models, leveraging their inherent understanding capabilities for diverse tasks from segmentation to 3D depth estimation. This paradigm shift allows for state-of-the-art performance while simplifying architecture, but be mindful of the current computational overhead for widespread deployment.
Key insights
Instruction-tuning image generators unlocks powerful generalist visual understanding capabilities, mirroring LLM emergent behaviors.
Principles
- Generative pretraining establishes robust visual representations for diverse tasks.
- Image generation provides a universal interface for vision tasks via RGB output encoding.
- Lightweight instruction-tuning can activate latent understanding without degrading generation.
Method
Instruction-tune a base image generator with low-ratio vision task data, formatting task outputs as invertible RGB images for quantitative evaluation.
In practice
- Represent semantic segmentation masks as multi-colored images for class decoding.
- Encode metric depth values into false-color RGB images using a power transform.
- Map surface normal vectors directly to RGB channels for visualization.
Topics
- Image Generators
- Vision Models
- Instruction Tuning
- Generative Pretraining
- Semantic Segmentation
- Depth Estimation
- Foundational Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.