HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Summary
Large language models (LLMs) face a significant safety risk in collaborative writing scenarios, where users provide incomplete drafts for LLMs to complete or refine. Malicious users can exploit this by injecting dangerous content into drafts, forcing LLMs to generate harmful outputs, a vulnerability termed "draft-based co-authoring jailbreak attacks." Researchers Euntae Kim, Soomin Han, and Buru Chang introduce HarDBench, a new benchmark to systematically evaluate LLM robustness against these attacks. HarDBench includes prompts across high-risk domains like Explosives, Drugs, Weapons, and Cyberattacks, designed with realistic structures and domain-specific cues. The team also proposes a safety-utility balanced alignment approach using preference optimization to train models to refuse harmful completions while maintaining helpfulness on benign drafts. Their experiments indicate existing LLMs are highly vulnerable, but their alignment method substantially reduces harmful outputs without performance degradation.
Key takeaway
For research scientists and engineers developing LLMs for collaborative writing, you must account for draft-based co-authoring jailbreak attacks. Your current safety evaluations may not cover this specific vulnerability, potentially exposing your models to malicious content generation. Integrate benchmarks like HarDBench into your testing protocols and explore preference optimization techniques to align models for both safety and utility, ensuring your LLMs can refuse harmful completions without sacrificing helpfulness on legitimate tasks.
Key insights
LLMs are vulnerable to draft-based jailbreaks in co-authoring, necessitating specialized benchmarks and alignment for safety.
Principles
- Draft-based co-authoring introduces unique jailbreak vectors.
- Safety alignment must balance refusal of harm with utility.
Method
HarDBench evaluates LLM robustness against draft-based jailbreaks using high-risk domain prompts. A preference optimization-based alignment trains models to refuse harmful completions while remaining helpful.
In practice
- Use HarDBench to test LLM safety in co-authoring.
- Implement preference optimization for safety-utility balance.
Topics
- LLM Jailbreak Attacks
- Co-Authoring Safety
- HarDBench Benchmark
- Preference Optimization
- Human-LLM Collaboration
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.