HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Summary
HarDBench is a new benchmark designed to evaluate the robustness of large language models (LLMs) against "draft-based co-authoring jailbreak attacks." These attacks exploit LLMs' collaborative writing capabilities, where users provide incomplete drafts to prompt the model into generating harmful content. The benchmark covers high-risk domains such as Explosives, Drugs, Weapons, and Cyberattacks, using prompts with realistic structures and domain-specific cues to assess susceptibility. Initial experimental results indicate that current LLMs are highly vulnerable in co-authoring contexts. To address this, a safety-utility balanced alignment approach, based on preference optimization, has been introduced, which significantly reduces harmful outputs while maintaining performance on benign co-authoring tasks. This work establishes a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing.
Key takeaway
For research scientists and CTOs developing or deploying LLMs for collaborative writing, you should prioritize evaluating your models against draft-based jailbreak attacks using benchmarks like HarDBench. Implement safety-utility balanced alignment techniques, such as preference optimization, to mitigate the risk of generating harmful content without compromising the model's helpfulness on legitimate co-authoring tasks. This proactive approach is critical for ensuring responsible LLM deployment.
Key insights
LLMs are vulnerable to draft-based jailbreaks in co-authoring, necessitating new safety benchmarks and alignment methods.
Principles
- Draft-based inputs enable novel jailbreak vectors.
- Safety-utility balance is crucial for LLM alignment.
Method
HarDBench evaluates LLM robustness using domain-specific, incomplete drafts across high-risk categories. A preference optimization-based alignment trains models to refuse harmful completions while remaining helpful.
In practice
- Use HarDBench to test LLM safety in co-authoring.
- Implement preference optimization for safer LLM alignment.
Topics
- Large Language Models
- Jailbreak Attacks
- Collaborative Writing
- HarDBench Benchmark
- Preference Optimization
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.