Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

2026-06-25 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

The "Ask, Solve, Generate" (ASG) framework introduces a self-evolving training approach for unified large multimodal models (LMMs), enabling autonomous improvement in both visual understanding and image generation using only unlabeled images. This framework operates with three internal roles: a Proposer for generating visual questions, a Solver for answering and evaluating them, and a Generator for synthesizing images. Training relies solely on self-derived consistency signals, eliminating the need for human annotations or external reward models. To stabilize learning, ASG incorporates Solver Token Entropy (STE), a continuous difficulty signal based on token-level prediction uncertainty. For image generation, a multi-scale internal evaluation scheme combines question-answer fidelity scoring with cycle-consistent captioning, creating a solver-mediated coupling where improved understanding enhances generation assessment. The framework is compatible with architectures like BLIP3o, BAGEL, and VARGPT-v1.1. It achieves consistent improvements across eight understanding metrics, including a +3.5% absolute gain on MMMU and an increase in GenEval image generation performance from 82% to 85% on BAGEL. Code and models are publicly released.

Key takeaway

For Machine Learning Engineers developing large multimodal models, this self-evolving framework offers a path to significantly reduce reliance on costly human annotations. You should explore integrating self-derived consistency signals and Solver Token Entropy (STE) into your training pipelines. This approach can autonomously enhance both visual understanding and image generation, potentially accelerating model development and improving performance on metrics like MMMU and GenEval without extensive manual supervision.

Key insights

Unified LMMs can autonomously improve multimodal understanding and generation through self-derived consistency signals.

Principles

Self-consistency signals enable LMM training without external supervision.
Coupling understanding and generation via internal evaluation strengthens both.
Token-level prediction uncertainty stabilizes self-supervised learning.

Method

A framework with Proposer (questions), Solver (answers/evaluates), and Generator (images) uses self-derived consistency. Solver Token Entropy (STE) and multi-scale internal evaluation (QA fidelity + cycle-consistent captioning) are key.

In practice

Implement Solver Token Entropy (STE) for self-supervised LMM stability.
Design internal evaluation combining QA fidelity and cycle-consistency.
Adapt the framework to diffusion or autoregressive LMM backbones.

Topics

Large Multimodal Models
Self-supervised Learning
Image Generation
Visual Understanding
Self-consistency Rewards
Solver Token Entropy

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.