HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A new benchmark, HarDBench, has been introduced to evaluate the vulnerability of large language models (LLMs) to "draft-based co-authoring jailbreak attacks." These attacks exploit LLMs' co-authoring capabilities by having users provide incomplete, harmful drafts (e.g., for explosives, drugs, weapons, or cyberattacks) and then prompting the LLM to complete or refine them, often bypassing existing safety mechanisms. HarDBench includes 1,204 validated drafts and uses realistic task framing to simulate collaborative writing scenarios. Experimental results show that current LLMs, including ChatGPT and Gemini, are highly susceptible, with Harmfulness Scores (HS) above 4.29 and Attack Success Rates (ASR) exceeding 80% under co-authoring conditions. To mitigate this, a "safety-utility balanced alignment" (SUBA) approach, based on preference optimization, was developed. SUBA trains models to refuse harmful completions while remaining helpful on benign drafts, significantly reducing harmful outputs without degrading co-authoring performance on public benchmarks like WritingBench and LongBench-Write.

Key takeaway

For research scientists and CTOs developing or deploying LLMs for collaborative writing, this research highlights a critical, underexplored vulnerability: draft-based jailbreaks. You should integrate benchmarks like HarDBench into your safety evaluations and consider implementing safety-utility balanced alignment (SUBA) to ensure models can discern and refuse harmful content while preserving their helpfulness for legitimate co-authoring tasks. This proactive approach is crucial to building trustworthy human-AI collaborative systems and preventing real-world misuse.

Key insights

LLMs are vulnerable to draft-based jailbreaks, but a balanced alignment approach can enhance safety without sacrificing utility.

Principles

Method

HarDBench constructs jailbreak prompts by embedding harmful drafts into co-authoring task frames. SUBA uses preference optimization with contrastive labels (refusal for harmful, completion for benign) to balance safety and utility.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.