Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

The Forced Deferral Attack (FDA) is a newly identified vulnerability targeting Multimodal Large Language Model (MLLM) cascades. These cascades are designed to reduce computational costs by initially querying a weaker, cheaper model and only deferring to a stronger, more expensive model when the weak model expresses low confidence. FDA exploits this mechanism by using an adversarial image attack to manipulate the weak model's confidence, forcing queries to be routed to the computationally intensive strong model. The attack learns a universal border trigger and optimizes a temperature-flattened objective to push the weak model's token distribution towards less concentrated targets. FDA consistently increases strong-model routing across various datasets, model families, and deferral metrics, outperforming image-perturbation and prompt-injection baselines. This demonstrates that MLLM cascades are susceptible to attacks that manipulate compute allocation, leading to unintended strong-model usage without directly targeting answer correctness.

Key takeaway

For AI Security Engineers and MLLM system architects deploying cascade models, this research highlights a critical vulnerability: the Forced Deferral Attack. Your MLLM cascades are susceptible to adversarial image attacks that can manipulate weak model confidence, forcing expensive strong model usage and significantly increasing operational costs without affecting answer correctness. You should prioritize implementing robust confidence estimation mechanisms and develop defenses against universal adversarial triggers to safeguard compute allocation and maintain cost efficiency.

Key insights

Adversarial attacks can manipulate MLLM cascade weak model confidence, forcing expensive strong model usage and increasing computational cost.

Principles

Method

FDA learns a universal border trigger by optimizing a temperature-flattened objective, pushing the weak model's token distribution on triggered inputs toward less concentrated targets from clean responses.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.