HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

2026-04-21 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Large language models (LLMs) face a significant safety risk in collaborative writing scenarios, where users provide incomplete drafts for LLMs to complete or refine. Malicious users can exploit this by injecting dangerous content into drafts, forcing LLMs to generate harmful outputs, a vulnerability termed "draft-based co-authoring jailbreak attacks." Researchers Euntae Kim, Soomin Han, and Buru Chang introduce HarDBench, a new benchmark to systematically evaluate LLM robustness against these attacks. HarDBench includes prompts across high-risk domains like Explosives, Drugs, Weapons, and Cyberattacks, designed with realistic structures and domain-specific cues. The team also proposes a safety-utility balanced alignment approach using preference optimization to train models to refuse harmful completions while maintaining helpfulness on benign drafts. Their experiments indicate existing LLMs are highly vulnerable, but their alignment method substantially reduces harmful outputs without performance degradation.

Key takeaway

For research scientists and engineers developing LLMs for collaborative writing, you must account for draft-based co-authoring jailbreak attacks. Your current safety evaluations may not cover this specific vulnerability, potentially exposing your models to malicious content generation. Integrate benchmarks like HarDBench into your testing protocols and explore preference optimization techniques to align models for both safety and utility, ensuring your LLMs can refuse harmful completions without sacrificing helpfulness on legitimate tasks.

Key insights

LLMs are vulnerable to draft-based jailbreaks in co-authoring, necessitating specialized benchmarks and alignment for safety.

Principles

Draft-based co-authoring introduces unique jailbreak vectors.
Safety alignment must balance refusal of harm with utility.

Method

HarDBench evaluates LLM robustness against draft-based jailbreaks using high-risk domain prompts. A preference optimization-based alignment trains models to refuse harmful completions while remaining helpful.

In practice

Use HarDBench to test LLM safety in co-authoring.
Implement preference optimization for safety-utility balance.

Topics

LLM Jailbreak Attacks
Co-Authoring Safety
HarDBench Benchmark
Preference Optimization
Human-LLM Collaboration

Code references

untae0122/HarDBench

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.