CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents
Summary
CAPED (Context-Aware Privacy Exposure Defense) is a phone-side pre-upload protection layer designed for mobile GUI agents to mitigate "incidental visual privacy exposure." This system extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content essential for the current task, masking incidental private data before screenshots are sent to a remote multimodal agent. Evaluated on AndroidWorld for broad task utility and a controlled 28-task seeded privacy evaluation, Full CAPED reduced success-conditioned weighted seeded leakage (WSLR) from 0.766 under raw screenshots to 0.268, while maintaining high task utility (0.929). A broader AndroidWorld run showed a prototype-level utility cost, completing 64 of 116 tasks (55.2%) compared to 77 tasks (66.4%) for the unprotected baseline. The results emphasize treating screenshot upload as an explicit device–cloud boundary decision, governed by task-driven selective exposure.
Key takeaway
For AI Engineers developing mobile GUI agents, you should integrate phone-side pre-upload privacy controls like CAPED to prevent incidental visual data leakage. This approach ensures task-relevant content is exposed while masking sensitive, task-irrelevant information, significantly reducing privacy risks. Consider implementing local task interpretation and context-aware selective exposure to balance utility and user privacy effectively, treating each screenshot upload as a critical device-cloud boundary decision.
Key insights
Mobile GUI agents require task-driven selective exposure to prevent incidental visual privacy leakage at the device-cloud boundary.
Principles
- Protect before upload on the phone side.
- Preserve task utility through element granularity.
- Use screen context as a privacy prior.
Method
CAPED extracts task requirements locally, classifies screen context, parses UI elements, and resolves exposure decisions based on task relevance and element modality, then redacts screenshots.
In practice
- Implement local task requirement extraction.
- Apply context-aware default privacy postures.
- Use modality-specific verification for elements.
Topics
- Mobile GUI Agents
- Visual Privacy Exposure
- Context-Aware Redaction
- Device-Cloud Security
- AndroidWorld Benchmark
- Multimodal Models
Best for: Research Scientist, AI Scientist, AI Security Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.