Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Summary
A new compute-aware evaluation framework assesses adversarial robustness in large language models (LLMs) by considering the computational expense of attacks, measured in cumulative floating-point operations (FLOPs). This framework, which introduces risk-compute curves and two new metrics, addresses the limitation of traditional evaluations that fix query budgets and overlook varying attack costs. Across ten models from three families, evaluated with gradient-based, iterative refinement, and template-based strategies on two jailbreak benchmarks, researchers found that alignment training has non-monotonic effects on robustness. Scaling model size reduced gradient-based attack effectiveness but had limited impact on cheaper template-based attacks. Notably, gradient-based attacks optimized on a surrogate model could transfer to a target model, reducing attacker costs. Furthermore, compute cost varied by up to ≈5× across harm categories, and safety-aligned RL increased aggregate cost while leaving some categories disproportionately accessible. The framework is released to enable more accurate risk assessment.
Key takeaway
For AI Security Engineers evaluating large language model robustness, you must move beyond fixed query budgets to incorporate computational pressure (FLOPs) into your assessments. This framework reveals that alignment training effects are complex, and model scaling doesn't uniformly deter all attack types. Prioritize evaluating compute costs across harm categories, as these can vary by up to ≈5×, and consider the risk of transferable surrogate model attacks to accurately gauge your model's true vulnerability.
Key insights
Adversarial robustness evaluations of LLMs should account for computational cost (FLOPs) to accurately reflect attacker effort and risk.
Principles
- Alignment training effects on robustness are non-monotonic.
- Scaling model size reduces gradient-based attack effectiveness.
- Surrogate model attacks can transfer, lowering attacker costs.
Method
The framework uses computational pressure (cumulative FLOPs) as an adversarial effort proxy, introducing risk-compute curves and deriving two metrics to summarize average pressure for attack success.
In practice
- Evaluate LLM robustness using compute-aware metrics.
- Assess attack costs across different harm categories.
- Explore surrogate models for cost-effective attack transfer.
Topics
- LLM Adversarial Robustness
- Computational Pressure
- Risk-Compute Curves
- Alignment Training
- Gradient-Based Attacks
- Jailbreak Benchmarks
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.