Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
Summary
The "Rapid Poison" attack framework demonstrates practical poisoning vulnerabilities against the Rapid Response (RR) framework, a system deployed in production, including Anthropic's ASL-3 safeguards, designed to continuously improve jailbreak-detection classifiers. RR generates synthetic variants from new jailbreaks to enhance model generalization and adaptation. Rapid Poison exploits prompt injection to deliver poisoned samples into the classifier's training set, enabling two objectives. First, targeted poisoning creates false positives on harmless samples, categorizing them as jailbreaks based on specific features like formatting or keywords. Second, concept-based backdoor attacks induce false negatives on actual jailbreak inputs, even those from strategies the defender explicitly trained against, when a backdoor trigger is present. This second objective is particularly challenging given the constraint that adversaries modify only jailbreak samples. The "Omission Attack" addresses this by exploiting a phenomenon where training on concept-absent unsafe samples misassociates the concept's presence with a safe label. Both attacks achieve substantial label flipping, with a 1% poisoning rate leading to up to 100% false positive rates and up to 96% false negative rates.
Key takeaway
For AI Security Engineers managing continuous learning systems like jailbreak detectors, you must recognize the severe risk of prompt injection-based data poisoning. Even with adversaries limited to modifying only jailbreak samples, a mere 1% poisoning rate can achieve up to 100% false positive rates on benign inputs or 96% false negatives on actual jailbreaks. Implement stringent input validation and anomaly detection within synthetic data generation pipelines to safeguard against targeted attacks and concept-based backdoors.
Key insights
Prompt injection can poison Rapid Response training data, causing targeted false positives and concept-based backdoor false negatives with high efficacy.
Principles
- Continuous learning systems face data poisoning risks.
- Synthetic data generation pipelines are exploitable.
- Minimal poisoning rates achieve high impact.
Method
The Omission Attack exploits training on concept-absent unsafe samples, causing the classifier to misassociate that concept's presence with a safe label, enabling concept-based backdoor attacks.
In practice
- Sanitize inputs feeding into training pipelines.
- Monitor synthetic data generation for anomalies.
- Evaluate classifier robustness to prompt injection.
Topics
- Rapid Response Framework
- Data Poisoning
- Prompt Injection
- Jailbreak Detection
- Backdoor Attacks
- AI Safety
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.