Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

The "Ask, Solve, Generate" (ASG) framework introduces a self-evolving training approach for unified large multimodal models (LMMs), enabling autonomous improvement in both visual understanding and image generation using only unlabeled images. This framework operates with three internal roles: a Proposer for generating visual questions, a Solver for answering and evaluating them, and a Generator for synthesizing images. Training relies solely on self-derived consistency signals, eliminating the need for human annotations or external reward models. To stabilize learning, ASG incorporates Solver Token Entropy (STE), a continuous difficulty signal based on token-level prediction uncertainty. For image generation, a multi-scale internal evaluation scheme combines question-answer fidelity scoring with cycle-consistent captioning, creating a solver-mediated coupling where improved understanding enhances generation assessment. The framework is compatible with architectures like BLIP3o, BAGEL, and VARGPT-v1.1. It achieves consistent improvements across eight understanding metrics, including a +3.5% absolute gain on MMMU and an increase in GenEval image generation performance from 82% to 85% on BAGEL. Code and models are publicly released.

Key takeaway

For Machine Learning Engineers developing large multimodal models, this self-evolving framework offers a path to significantly reduce reliance on costly human annotations. You should explore integrating self-derived consistency signals and Solver Token Entropy (STE) into your training pipelines. This approach can autonomously enhance both visual understanding and image generation, potentially accelerating model development and improving performance on metrics like MMMU and GenEval without extensive manual supervision.

Key insights

Unified LMMs can autonomously improve multimodal understanding and generation through self-derived consistency signals.

Principles

Method

A framework with Proposer (questions), Solver (answers/evaluates), and Generator (images) uses self-derived consistency. Solver Token Entropy (STE) and multi-scale internal evaluation (QA fidelity + cycle-consistent captioning) are key.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.