Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards
Summary
The "Ask, Solve, Generate" (ASG) framework introduces a self-evolving training approach for unified large multimodal models (LMMs), enabling autonomous improvement in both visual understanding and image generation using only unlabeled images. This framework operates with three internal roles: a Proposer for generating visual questions, a Solver for answering and evaluating them, and a Generator for synthesizing images. Training relies solely on self-derived consistency signals, eliminating the need for human annotations or external reward models. To stabilize learning, ASG incorporates Solver Token Entropy (STE), a continuous difficulty signal based on token-level prediction uncertainty. For image generation, a multi-scale internal evaluation scheme combines question-answer fidelity scoring with cycle-consistent captioning, creating a solver-mediated coupling where improved understanding enhances generation assessment. The framework is compatible with architectures like BLIP3o, BAGEL, and VARGPT-v1.1. It achieves consistent improvements across eight understanding metrics, including a +3.5% absolute gain on MMMU and an increase in GenEval image generation performance from 82% to 85% on BAGEL. Code and models are publicly released.
Key takeaway
For Machine Learning Engineers developing large multimodal models, this self-evolving framework offers a path to significantly reduce reliance on costly human annotations. You should explore integrating self-derived consistency signals and Solver Token Entropy (STE) into your training pipelines. This approach can autonomously enhance both visual understanding and image generation, potentially accelerating model development and improving performance on metrics like MMMU and GenEval without extensive manual supervision.
Key insights
Unified LMMs can autonomously improve multimodal understanding and generation through self-derived consistency signals.
Principles
- Self-consistency signals enable LMM training without external supervision.
- Coupling understanding and generation via internal evaluation strengthens both.
- Token-level prediction uncertainty stabilizes self-supervised learning.
Method
A framework with Proposer (questions), Solver (answers/evaluates), and Generator (images) uses self-derived consistency. Solver Token Entropy (STE) and multi-scale internal evaluation (QA fidelity + cycle-consistent captioning) are key.
In practice
- Implement Solver Token Entropy (STE) for self-supervised LMM stability.
- Design internal evaluation combining QA fidelity and cycle-consistency.
- Adapt the framework to diffusion or autoregressive LMM backbones.
Topics
- Large Multimodal Models
- Self-supervised Learning
- Image Generation
- Visual Understanding
- Self-consistency Rewards
- Solver Token Entropy
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.