Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

2026-06-16 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Multimodal Hidden Instruction Attacks on Agent Skill Scanners introduces a new vulnerability in LLM-based systems where malicious operational instructions can be concealed within images, bypassing current text-centric skill scanners. Existing defenses primarily analyze textual descriptions, manifests, and source code, creating a blind spot for visually conveyed threats. Researchers propose SkillCamo, a document-mediated multimodal instruction attack that embeds harmful instructions within images bundled with a skill, while simultaneously rewriting documentation to naturally reference these images as part of a normal workflow. This attack leverages the joint interpretation of textual guidance and visual payload during agent execution. To counter this, ExecScan is proposed, an execution-grounded multimodal scanning module. ExecScan performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation, jointly analyzing all skill artifacts including visual content to recover hidden instructions and identify risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Experiments confirm that image-hidden malicious instructions challenge existing scanners, while ExecScan significantly improves detection performance.

Key takeaway

For AI Security Engineers responsible for securing LLM-based agents, you must evolve beyond text-only skill scanning. Your current defenses likely overlook malicious instructions hidden within images, as demonstrated by SkillCamo. Implement execution-grounded multimodal scanning solutions like ExecScan to jointly analyze documentation, code, and visual content. This approach is crucial to recover hidden instructions and proactively identify risks such as data exfiltration or privilege escalation before deployment.

Key insights

Multimodal hidden instructions exploit text-only agent skill scanners, necessitating execution-grounded multimodal defense.

Principles

Text-based skill scanners have a practical blind spot for visual threats.
Malicious intent can be conveyed through joint text and image interpretation.
Comprehensive scanning requires execution-grounded multimodal analysis.

Method

SkillCamo conceals instructions in images referenced by rewritten documentation for joint interpretation. ExecScan performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation.

In practice

Identify risks like exfiltration, destruction, persistence, deception, and privilege escalation.
Analyze documentation, code, referenced resources, and visual content together.

Topics

LLM Security
Agent Skills
Multimodal Attacks
Skill Scanners
ExecScan
Computer Vision

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Security Engineer, AI Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.