Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Summary
Multimodal web agents, which utilize both visual screenshots and accessibility trees for web interaction, are vulnerable to cross-modal adversarial attacks where malicious content injected into the webpage DOM simultaneously corrupts both observation channels. A vulnerability analysis on MiniWob++ demonstrated that attacks incorporating a visual component significantly outperform text-only injections, highlighting deficiencies in current text-centric Vision-Language Model (VLM) safety training. To address this, Dual-Modality Multi-Stage Adversarial Safety Training (DMAST) is proposed. DMAST formalizes agent-attacker interaction as a two-player zero-sum Markov game and employs a three-stage co-training pipeline: imitation learning, oracle-guided supervised fine-tuning with a zero-acknowledgment strategy, and adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. DMAST substantially reduces adversarial risks and doubles task completion efficiency on out-of-distribution tasks, outperforming existing defenses.
Key takeaway
For AI Scientists developing multimodal web agents, recognizing the heightened risk from cross-modal attacks is crucial. You should prioritize safety training that accounts for simultaneous corruption of visual and accessibility tree data, moving beyond text-only VLM defenses. Implementing a multi-stage adversarial training framework like DMAST can significantly improve agent robustness and efficiency against sophisticated, unseen threats.
Key insights
Cross-modal attacks on multimodal web agents exploit dual observation channels, necessitating specialized safety training.
Principles
- Visual attack components amplify adversarial effectiveness.
- Co-evolutionary training enhances agent robustness.
- Zero-acknowledgment strategy improves task reasoning.
Method
DMAST co-trains agents and attackers through imitation learning, oracle-guided supervised fine-tuning using a zero-acknowledgment strategy, and adversarial reinforcement learning via GRPO self-play.
In practice
- Integrate visual attack vectors in agent safety testing.
- Apply DMAST's three-stage pipeline for robust agent training.
- Consider GRPO for adversarial reinforcement learning.
Topics
- Multimodal Web Agents
- Adversarial Safety Training
- Cross-Modal Attacks
- Reinforcement Learning
- Group Relative Policy Optimization
Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.