On the Adversarial Robustness of Multimodal LLM Judges
Summary
RobustMLLMJudge is introduced as the first general framework to evaluate the adversarial robustness of Multimodal Large Language Models (MLLMs) when used as automated judges for tasks like image quality and safety assessment. This framework reveals that various MLLM judges are highly susceptible to score-inflating adversarial attacks. A critical challenge for these attack methods lies in the unique evaluation protocol constraints of MLLM judges. To overcome this, the paper proposes MGSIA, the Manifold-Guided Semantic Induction Attack, a novel method designed to bypass these constraints. MGSIA combines affirmative semantic induction with high-score manifold alignment, maximizing affirmative responses to binary semantic queries while regularizing adversarial representations towards high-score centers. This approach generates transferable score-inflating perturbations, demonstrating superior generalizability in deceiving advanced MLLM judges across different evaluation scenarios.
Key takeaway
For AI Security Engineers or ML Engineers deploying Multimodal LLM judges, you must prioritize adversarial robustness. The demonstrated vulnerability to score-inflating attacks, even with protocol constraints, means your automated judging systems are susceptible to manipulation. You should integrate frameworks like RobustMLLMJudge into your evaluation pipelines and actively develop defenses against advanced methods such as MGSIA to ensure the fairness and reliability of your MLLM-based assessments.
Key insights
MLLM judges are vulnerable to adversarial attacks, necessitating robust evaluation frameworks and new attack methods like MGSIA.
Principles
- MLLM judges are highly vulnerable to score-inflating attacks.
- Attack methods face unique evaluation protocol constraints.
- Combining semantic induction with manifold alignment yields transferable attacks.
Method
MGSIA combines affirmative semantic induction with high-score manifold alignment to maximize affirmative responses and regularize adversarial representations toward high-score centers.
In practice
- Implement RobustMLLMJudge for MLLM adversarial testing.
- Develop defenses against MGSIA-style score inflation.
- Assess MLLM judge reliability in safety evaluations.
Topics
- Multimodal LLMs
- Adversarial Robustness
- MLLM Judges
- RobustMLLMJudge
- MGSIA
- Image Quality Assessment
- Safety Assessment
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.