A Theoretical Game of Attacks via Compositional Skills

2025-04-14 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A new theoretical framework formalizes the adversarial game between an attacker using compositional skills to hide malicious intent and a resource-constrained defender employing prompt and response filtering in large language models (LLMs). This framework introduces a best-response attack strategy, proving its superiority over existing adversarial prompting methods like fixed-skill and optimization-based attacks. The research characterizes the game's equilibria, revealing inherent advantages for the attacker, particularly as the skill composition space expands. Based on this analysis, a provably optimal defense strategy is derived, which actively misleads the attacker by distorting perceived weak points. Empirical evaluations using an LLM-based rater (GPT-4.1) and judge (LLaMA-3-70B) on datasets like JBB-Behaviors and MaliciousInstructions demonstrate that the practical instantiation of the best-response attack outperforms existing methods across various LLMs (e.g., GPT-3.5-Turbo-1106, Llama-2-7B-chat-hf) and that the proposed defense significantly reduces attack performance.

Key takeaway

For security architects and red-teaming specialists evaluating LLM vulnerabilities, this research indicates that traditional defenses focused solely on scaling capacity are insufficient against sophisticated, skill-compositional attacks. You should prioritize implementing defense mechanisms that actively mislead attackers by distorting perceived system weaknesses, rather than just increasing filtering robustness. Consider integrating multi-stage probing and response aggregation in your red-teaming efforts to accurately assess the true risk of intent-hiding attacks, as single-prompt evaluations may underestimate actual exploitability.

Key insights

Game theory reveals attackers gain advantage by composing skills to hide intent, but a misleading defense can mitigate this.

Principles

Attacker utility increases with skill composition space size.
Defender capacity scaling alone is insufficient against complex attacks.
Optimal defense misleads attackers to concentrate resources on fake weak points.

Method

A two-stage best-response attack: probe LLMs with skill-intent combinations, then concentrate attacks on identified weak points. Defense misleads by exposing incorrect performance distributions.

In practice

Use GPT-4.1 as an LLM-based rater for helpfulness evaluation.
Generate multiple prompts per intent to aggregate complementary information.
Prioritize defense resources on most probable intents and their perceived weak points.

Topics

Game Theory
Adversarial Prompting
Large Language Models
Compositional Skills
LLM Defense Mechanisms

Code references

meta-llama/llama3

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.