STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
Summary
STAR-Teaming is a novel black-box framework designed for automated red teaming of Large Language Models (LLMs) to identify and exploit jailbreak vulnerabilities. It integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network, employing network-driven optimization to generate effective attack prompts. The framework constructs a multiplex network from past attack logs, modeling statistical relationships between attack strategies and LLM responses, which enhances interpretability of LLM vulnerabilities and streamlines the search for effective strategies by organizing the search space into semantic communities. This approach prevents redundant exploration and significantly surpasses existing methods, achieving an average attack success rate (ASR) of 74.5% on HarmBench, outperforming AutoDAN-Turbo by 13.5%, while also demonstrating higher efficiency and lower computational cost. The system also supports dynamic network expansion, increasing ASR by 6.3 percentage points and reducing attack trials by 14.2% on Llama-2-7b-chat.
Key takeaway
For AI safety researchers and red-teaming teams focused on LLM robustness, STAR-Teaming offers a superior method for identifying vulnerabilities. Its network-driven approach not only achieves higher attack success rates and efficiency but also provides interpretability into why certain strategies work. You should consider integrating this multiplex network framework to enhance your automated red-teaming efforts, especially for challenging models like Claude-3.5-Sonnet, and explore its dynamic expansion capabilities to adapt to evolving defense behaviors.
Key insights
STAR-Teaming uses a multiplex network to efficiently discover LLM jailbreaks with high success rates and interpretability.
Principles
- Network-based strategy sampling enhances efficiency and interpretability.
- Dynamic network expansion improves adaptability to emerging attack patterns.
- Optimizing strategy selection as an Inverse Ising Problem is computationally efficient.
Method
STAR-Teaming constructs a strategy-response multiplex network from attack logs, identifies communities of strategies and responses, and optimizes an interaction matrix Z to probabilistically sample effective attack strategies for a Multi-Agent System.
In practice
- Use gpt-4o-mini as a scorer and strategy extractor.
- Employ Leiden algorithm for community detection in networks.
- Dynamically adjust the inverse-temperature parameter β for exploration/exploitation.
Topics
- LLM Red Teaming
- Strategy-Response Multiplex Network
- Multi-Agent Systems
- Jailbreak Prompt Generation
- Network-driven Optimization
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.