Image Generators are Generalist Vision Learners

2026-03-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Vision Banana, a new generalist vision model, demonstrates that image generation training serves as powerful pretraining for visual understanding. Developed by instruction-tuning the Nano Banana Pro image generator on a mix of its original data and a small amount of vision task data, Vision Banana reframes perception tasks as image generation by outputting RGB images. This approach achieves state-of-the-art results across various 2D and 3D vision tasks, surpassing or rivaling zero-shot domain specialists like Segment Anything Model 3 on segmentation and Depth Anything 3 on metric depth estimation. Crucially, it retains the base model's image generation capabilities, with win rates of 53.5% on GenAI-Bench and 47.8% on ImgEdit. This work suggests a paradigm shift where generative vision pretraining becomes central to building Foundational Vision Models for both generation and understanding.

Key takeaway

For AI Scientists and Machine Learning Engineers developing next-generation vision systems, this research indicates that generative pretraining with instruction-tuning offers a powerful, unified approach. You should consider image generators like Nano Banana Pro as foundational models, leveraging their inherent understanding capabilities for diverse tasks from segmentation to 3D depth estimation. This paradigm shift allows for state-of-the-art performance while simplifying architecture, but be mindful of the current computational overhead for widespread deployment.

Key insights

Instruction-tuning image generators unlocks powerful generalist visual understanding capabilities, mirroring LLM emergent behaviors.

Principles

Generative pretraining establishes robust visual representations for diverse tasks.
Image generation provides a universal interface for vision tasks via RGB output encoding.
Lightweight instruction-tuning can activate latent understanding without degrading generation.

Method

Instruction-tune a base image generator with low-ratio vision task data, formatting task outputs as invertible RGB images for quantitative evaluation.

In practice

Represent semantic segmentation masks as multi-colored images for class decoding.
Encode metric depth values into false-color RGB images using a power transform.
Map surface normal vectors directly to RGB channels for visualization.

Topics

Image Generators
Vision Models
Instruction Tuning
Generative Pretraining
Semantic Segmentation
Depth Estimation
Foundational Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.