HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

2025-07-03 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A new benchmark, HarDBench, has been introduced to evaluate the vulnerability of large language models (LLMs) to "draft-based co-authoring jailbreak attacks." These attacks exploit LLMs' co-authoring capabilities by having users provide incomplete, harmful drafts (e.g., for explosives, drugs, weapons, or cyberattacks) and then prompting the LLM to complete or refine them, often bypassing existing safety mechanisms. HarDBench includes 1,204 validated drafts and uses realistic task framing to simulate collaborative writing scenarios. Experimental results show that current LLMs, including ChatGPT and Gemini, are highly susceptible, with Harmfulness Scores (HS) above 4.29 and Attack Success Rates (ASR) exceeding 80% under co-authoring conditions. To mitigate this, a "safety-utility balanced alignment" (SUBA) approach, based on preference optimization, was developed. SUBA trains models to refuse harmful completions while remaining helpful on benign drafts, significantly reducing harmful outputs without degrading co-authoring performance on public benchmarks like WritingBench and LongBench-Write.

Key takeaway

For research scientists and CTOs developing or deploying LLMs for collaborative writing, this research highlights a critical, underexplored vulnerability: draft-based jailbreaks. You should integrate benchmarks like HarDBench into your safety evaluations and consider implementing safety-utility balanced alignment (SUBA) to ensure models can discern and refuse harmful content while preserving their helpfulness for legitimate co-authoring tasks. This proactive approach is crucial to building trustworthy human-AI collaborative systems and preventing real-world misuse.

Key insights

LLMs are vulnerable to draft-based jailbreaks, but a balanced alignment approach can enhance safety without sacrificing utility.

Principles

Task framing can conceal malicious intent in LLM interactions.
Scaling reasoning capabilities does not inherently improve LLM safety.
Benign data exposure is essential for maintaining LLM collaborative utility.

Method

HarDBench constructs jailbreak prompts by embedding harmful drafts into co-authoring task frames. SUBA uses preference optimization with contrastive labels (refusal for harmful, completion for benign) to balance safety and utility.

In practice

Use HarDBench to test LLM robustness against co-authoring misuse.
Implement SUBA-like preference optimization for LLM safety alignment.
Prioritize context-aware risk recognition in LLM development.

Topics

HarDBench
LLM Jailbreak Attacks
Collaborative Writing Safety
Preference Optimization
Safety-Utility Alignment

Code references

untae0122/HarDBench

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.