Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades
Summary
The Forced Deferral Attack (FDA) is a newly identified vulnerability targeting Multimodal Large Language Model (MLLM) cascades. These cascades are designed to reduce computational costs by initially querying a weaker, cheaper model and only deferring to a stronger, more expensive model when the weak model expresses low confidence. FDA exploits this mechanism by using an adversarial image attack to manipulate the weak model's confidence, forcing queries to be routed to the computationally intensive strong model. The attack learns a universal border trigger and optimizes a temperature-flattened objective to push the weak model's token distribution towards less concentrated targets. FDA consistently increases strong-model routing across various datasets, model families, and deferral metrics, outperforming image-perturbation and prompt-injection baselines. This demonstrates that MLLM cascades are susceptible to attacks that manipulate compute allocation, leading to unintended strong-model usage without directly targeting answer correctness.
Key takeaway
For AI Security Engineers and MLLM system architects deploying cascade models, this research highlights a critical vulnerability: the Forced Deferral Attack. Your MLLM cascades are susceptible to adversarial image attacks that can manipulate weak model confidence, forcing expensive strong model usage and significantly increasing operational costs without affecting answer correctness. You should prioritize implementing robust confidence estimation mechanisms and develop defenses against universal adversarial triggers to safeguard compute allocation and maintain cost efficiency.
Key insights
Adversarial attacks can manipulate MLLM cascade weak model confidence, forcing expensive strong model usage and increasing computational cost.
Principles
- Weak model confidence directly controls compute allocation in MLLM cascades.
- Adversarial triggers can universally lower model confidence.
- Attacks can target system efficiency, not just correctness.
Method
FDA learns a universal border trigger by optimizing a temperature-flattened objective, pushing the weak model's token distribution on triggered inputs toward less concentrated targets from clean responses.
In practice
- Implement robust confidence estimation in MLLM cascades.
- Monitor strong model usage for unexpected spikes.
- Develop defenses against universal adversarial triggers.
Topics
- Multimodal LLMs
- MLLM Cascades
- Forced Deferral Attack
- Adversarial Attacks
- Computational Cost
- Model Security
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.