Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Meta AI and Emory University researchers have developed BLPO, a bi-level prompt optimization framework designed to improve the alignment of multimodal Large Language Model (LLM) judges with human evaluations, particularly for AI-generated images. Existing auto prompt optimization (APO) methods struggle in multimodal contexts due to the limited visual context window of MLLMs, which restricts the number of visual examples for effective trial-and-error prompt refinement. BLPO addresses this by converting images into textual representations while preserving evaluation-relevant visual cues, jointly refining both the judge prompt and an image-to-text (I2T) prompt. Experiments across four datasets (AGIN, SeeTRUE, ImageReward, UnsafeBench) and three MLLM judges (Llama-4-Scout-17B-16E-instruct, Llama-4-Maverick-17B-128E-instruct, Qwen2.5-VL-32B-instruct) demonstrate BLPO's superior and more stable performance compared to baselines like OPRO, APO-image, and TextGrad, achieving an average of 8% higher on UnsafeBench.

Key takeaway

For AI Scientists and Research Scientists evaluating AI-generated content with multimodal LLMs, adopting the BLPO framework can significantly enhance evaluation alignment with human judgments. You should consider implementing this bi-level optimization approach to overcome context window limitations by intelligently verbalizing visual cues, leading to more accurate and stable judge performance. This method offers a cost-effective alternative to supervised fine-tuning for improving MLLM judge reliability.

Key insights

Bi-level prompt optimization improves multimodal LLM judge alignment by converting images to text and co-optimizing judge and I2T prompts.

Principles

Method

BLPO uses a bi-level optimization framework to refine both the judge prompt and an image-to-text (I2T) prompt. It converts images to textual representations to overcome MLLM context window limitations, ensuring fidelity under limited context budgets.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.