Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

2026-02-26 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A study introduces Multimodal Evaluator Preference Collapse (EPC) and cross-modal contagion in AI agents that use large language models (LLMs) for self-evaluation. Using GPT-4o to evaluate DeepSeek-chat on text and visual tasks, researchers found that EPC is dramatically amplified in multimodal contexts, with a single strategy ("step_by_step") absorbing 48.4% of all weight, a 3.2x increase over text-only self-evaluation. Visual-domain strategies received only 9.1% combined weight. The research also identified cross-modal contagion, where evaluator preferences acquired on one modality corrupt strategy selection on another, leading to strategy inversion. Statistical validation across GPT-4o, Qwen-plus, DashScope, and DeepSeek-chat evaluators revealed that cross-model evaluation produces strong contagion, while self-evaluation offers near-complete immunity (97% zero contagion). The study formalizes these dynamics with the contagion matrix Γ^{(ℳ)} and releases the MM-EPC experimental framework.

Key takeaway

For MLOps Engineers deploying multimodal AI agents with LLM evaluators, you must account for evaluator-conditional preference drift. Your systems risk silently converging to strategies optimized for the evaluator rather than the task, especially in cross-model evaluation. To mitigate this, consider using self-evaluation or multi-evaluator ensembles, and isolate modality-specific training phases to prevent cross-modal bias transfer. Monitor training rounds to avoid single-strategy collapse.

Key insights

Cross-modal LLM evaluation amplifies preference collapse and transfers biases, but self-evaluation offers strong immunity.

Principles

Evaluator identity dictates contagion dynamics.
Cross-model evaluation amplifies bias.
Excessive training rounds can collapse strategy diversity.

Method

The Test-Time Reinforcement Learning (TTRL) framework, a stochastic bandit process, updates strategy weights based on pairwise LLM evaluator judgments. An isolation training paradigm measures cross-modal contagion using a coefficient γ₊→₋.

In practice

Report PCI and Γ for dynamic evaluation systems.
Use multiple evaluators from different model families.
Isolate modality-specific training phases.

Topics

Multimodal AI
LLM Evaluation
Evaluator Bias
Cross-Modal Contagion
Self-Evaluation
Agent Systems
Test-Time Reinforcement Learning

Code references

aidless/mm-epc

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.