Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents
Summary
A study introduces Multimodal Evaluator Preference Collapse (EPC) and cross-modal contagion in AI agents that use large language models (LLMs) for self-evaluation. Using GPT-4o to evaluate DeepSeek-chat on text and visual tasks, researchers found that EPC is dramatically amplified in multimodal contexts, with a single strategy ("step_by_step") absorbing 48.4% of all weight, a 3.2x increase over text-only self-evaluation. Visual-domain strategies received only 9.1% combined weight. The research also identified cross-modal contagion, where evaluator preferences acquired on one modality corrupt strategy selection on another, leading to strategy inversion. Statistical validation across GPT-4o, Qwen-plus, DashScope, and DeepSeek-chat evaluators revealed that cross-model evaluation produces strong contagion, while self-evaluation offers near-complete immunity (97% zero contagion). The study formalizes these dynamics with the contagion matrix Γ^{(ℳ)} and releases the MM-EPC experimental framework.
Key takeaway
For MLOps Engineers deploying multimodal AI agents with LLM evaluators, you must account for evaluator-conditional preference drift. Your systems risk silently converging to strategies optimized for the evaluator rather than the task, especially in cross-model evaluation. To mitigate this, consider using self-evaluation or multi-evaluator ensembles, and isolate modality-specific training phases to prevent cross-modal bias transfer. Monitor training rounds to avoid single-strategy collapse.
Key insights
Cross-modal LLM evaluation amplifies preference collapse and transfers biases, but self-evaluation offers strong immunity.
Principles
- Evaluator identity dictates contagion dynamics.
- Cross-model evaluation amplifies bias.
- Excessive training rounds can collapse strategy diversity.
Method
The Test-Time Reinforcement Learning (TTRL) framework, a stochastic bandit process, updates strategy weights based on pairwise LLM evaluator judgments. An isolation training paradigm measures cross-modal contagion using a coefficient γ₊→₋.
In practice
- Report PCI and Γ for dynamic evaluation systems.
- Use multiple evaluators from different model families.
- Isolate modality-specific training phases.
Topics
- Multimodal AI
- LLM Evaluation
- Evaluator Bias
- Cross-Modal Contagion
- Self-Evaluation
- Agent Systems
- Test-Time Reinforcement Learning
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.