STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

STAR-Teaming is a novel black-box framework designed for automated red teaming of Large Language Models (LLMs) to identify and exploit jailbreak vulnerabilities. It integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network, employing network-driven optimization to generate effective attack prompts. The framework constructs a multiplex network from past attack logs, modeling statistical relationships between attack strategies and LLM responses, which enhances interpretability of LLM vulnerabilities and streamlines the search for effective strategies by organizing the search space into semantic communities. This approach prevents redundant exploration and significantly surpasses existing methods, achieving an average attack success rate (ASR) of 74.5% on HarmBench, outperforming AutoDAN-Turbo by 13.5%, while also demonstrating higher efficiency and lower computational cost. The system also supports dynamic network expansion, increasing ASR by 6.3 percentage points and reducing attack trials by 14.2% on Llama-2-7b-chat.

Key takeaway

For AI safety researchers and red-teaming teams focused on LLM robustness, STAR-Teaming offers a superior method for identifying vulnerabilities. Its network-driven approach not only achieves higher attack success rates and efficiency but also provides interpretability into why certain strategies work. You should consider integrating this multiplex network framework to enhance your automated red-teaming efforts, especially for challenging models like Claude-3.5-Sonnet, and explore its dynamic expansion capabilities to adapt to evolving defense behaviors.

Key insights

STAR-Teaming uses a multiplex network to efficiently discover LLM jailbreaks with high success rates and interpretability.

Principles

Method

STAR-Teaming constructs a strategy-response multiplex network from attack logs, identifies communities of strategies and responses, and optimizes an interaction matrix Z to probabilistically sample effective attack strategies for a Multi-Agent System.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.