HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Summary
A new benchmark, HarDBench, has been introduced to evaluate the vulnerability of large language models (LLMs) to "draft-based co-authoring jailbreak attacks." These attacks exploit LLMs' co-authoring capabilities by having users provide incomplete, harmful drafts (e.g., for explosives, drugs, weapons, or cyberattacks) and then prompting the LLM to complete or refine them, often bypassing existing safety mechanisms. HarDBench includes 1,204 validated drafts and uses realistic task framing to simulate collaborative writing scenarios. Experimental results show that current LLMs, including ChatGPT and Gemini, are highly susceptible, with Harmfulness Scores (HS) above 4.29 and Attack Success Rates (ASR) exceeding 80% under co-authoring conditions. To mitigate this, a "safety-utility balanced alignment" (SUBA) approach, based on preference optimization, was developed. SUBA trains models to refuse harmful completions while remaining helpful on benign drafts, significantly reducing harmful outputs without degrading co-authoring performance on public benchmarks like WritingBench and LongBench-Write.
Key takeaway
For research scientists and CTOs developing or deploying LLMs for collaborative writing, this research highlights a critical, underexplored vulnerability: draft-based jailbreaks. You should integrate benchmarks like HarDBench into your safety evaluations and consider implementing safety-utility balanced alignment (SUBA) to ensure models can discern and refuse harmful content while preserving their helpfulness for legitimate co-authoring tasks. This proactive approach is crucial to building trustworthy human-AI collaborative systems and preventing real-world misuse.
Key insights
LLMs are vulnerable to draft-based jailbreaks, but a balanced alignment approach can enhance safety without sacrificing utility.
Principles
- Task framing can conceal malicious intent in LLM interactions.
- Scaling reasoning capabilities does not inherently improve LLM safety.
- Benign data exposure is essential for maintaining LLM collaborative utility.
Method
HarDBench constructs jailbreak prompts by embedding harmful drafts into co-authoring task frames. SUBA uses preference optimization with contrastive labels (refusal for harmful, completion for benign) to balance safety and utility.
In practice
- Use HarDBench to test LLM robustness against co-authoring misuse.
- Implement SUBA-like preference optimization for LLM safety alignment.
- Prioritize context-aware risk recognition in LLM development.
Topics
- HarDBench
- LLM Jailbreak Attacks
- Collaborative Writing Safety
- Preference Optimization
- Safety-Utility Alignment
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.