A Theoretical Game of Attacks via Compositional Skills

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A new theoretical framework formalizes the adversarial game between an attacker using compositional skills to hide malicious intent and a resource-constrained defender employing prompt and response filtering in large language models (LLMs). This framework introduces a best-response attack strategy, proving its superiority over existing adversarial prompting methods like fixed-skill and optimization-based attacks. The research characterizes the game's equilibria, revealing inherent advantages for the attacker, particularly as the skill composition space expands. Based on this analysis, a provably optimal defense strategy is derived, which actively misleads the attacker by distorting perceived weak points. Empirical evaluations using an LLM-based rater (GPT-4.1) and judge (LLaMA-3-70B) on datasets like JBB-Behaviors and MaliciousInstructions demonstrate that the practical instantiation of the best-response attack outperforms existing methods across various LLMs (e.g., GPT-3.5-Turbo-1106, Llama-2-7B-chat-hf) and that the proposed defense significantly reduces attack performance.

Key takeaway

For security architects and red-teaming specialists evaluating LLM vulnerabilities, this research indicates that traditional defenses focused solely on scaling capacity are insufficient against sophisticated, skill-compositional attacks. You should prioritize implementing defense mechanisms that actively mislead attackers by distorting perceived system weaknesses, rather than just increasing filtering robustness. Consider integrating multi-stage probing and response aggregation in your red-teaming efforts to accurately assess the true risk of intent-hiding attacks, as single-prompt evaluations may underestimate actual exploitability.

Key insights

Game theory reveals attackers gain advantage by composing skills to hide intent, but a misleading defense can mitigate this.

Principles

Method

A two-stage best-response attack: probe LLMs with skill-intent combinations, then concentrate attacks on identified weak points. Defense misleads by exposing incorrect performance distributions.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.