CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

CAPED (Context-Aware Privacy Exposure Defense) is a phone-side pre-upload protection layer designed for mobile GUI agents to mitigate "incidental visual privacy exposure." This system extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content essential for the current task, masking incidental private data before screenshots are sent to a remote multimodal agent. Evaluated on AndroidWorld for broad task utility and a controlled 28-task seeded privacy evaluation, Full CAPED reduced success-conditioned weighted seeded leakage (WSLR) from 0.766 under raw screenshots to 0.268, while maintaining high task utility (0.929). A broader AndroidWorld run showed a prototype-level utility cost, completing 64 of 116 tasks (55.2%) compared to 77 tasks (66.4%) for the unprotected baseline. The results emphasize treating screenshot upload as an explicit device–cloud boundary decision, governed by task-driven selective exposure.

Key takeaway

For AI Engineers developing mobile GUI agents, you should integrate phone-side pre-upload privacy controls like CAPED to prevent incidental visual data leakage. This approach ensures task-relevant content is exposed while masking sensitive, task-irrelevant information, significantly reducing privacy risks. Consider implementing local task interpretation and context-aware selective exposure to balance utility and user privacy effectively, treating each screenshot upload as a critical device-cloud boundary decision.

Key insights

Mobile GUI agents require task-driven selective exposure to prevent incidental visual privacy leakage at the device-cloud boundary.

Principles

Method

CAPED extracts task requirements locally, classifies screen context, parses UI elements, and resolves exposure decisions based on task relevance and element modality, then redacts screenshots.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Security Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.