HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

HarDBench is a new benchmark designed to evaluate the robustness of large language models (LLMs) against "draft-based co-authoring jailbreak attacks." These attacks exploit LLMs' collaborative writing capabilities, where users provide incomplete drafts to prompt the model into generating harmful content. The benchmark covers high-risk domains such as Explosives, Drugs, Weapons, and Cyberattacks, using prompts with realistic structures and domain-specific cues to assess susceptibility. Initial experimental results indicate that current LLMs are highly vulnerable in co-authoring contexts. To address this, a safety-utility balanced alignment approach, based on preference optimization, has been introduced, which significantly reduces harmful outputs while maintaining performance on benign co-authoring tasks. This work establishes a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing.

Key takeaway

For research scientists and CTOs developing or deploying LLMs for collaborative writing, you should prioritize evaluating your models against draft-based jailbreak attacks using benchmarks like HarDBench. Implement safety-utility balanced alignment techniques, such as preference optimization, to mitigate the risk of generating harmful content without compromising the model's helpfulness on legitimate co-authoring tasks. This proactive approach is critical for ensuring responsible LLM deployment.

Key insights

LLMs are vulnerable to draft-based jailbreaks in co-authoring, necessitating new safety benchmarks and alignment methods.

Principles

Draft-based inputs enable novel jailbreak vectors.
Safety-utility balance is crucial for LLM alignment.

Method

HarDBench evaluates LLM robustness using domain-specific, incomplete drafts across high-risk categories. A preference optimization-based alignment trains models to refuse harmful completions while remaining helpful.

In practice

Use HarDBench to test LLM safety in co-authoring.
Implement preference optimization for safer LLM alignment.

Topics

Large Language Models
Jailbreak Attacks
Collaborative Writing
HarDBench Benchmark
Preference Optimization

Code references

untae0122/HarDBench

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.