DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, quick

Summary

DDOR (Delta Debugging for OverRefusal) is an automated and explainable framework designed to test and repair overrefusal in large language models (LLMs) operating in a black-box setting. This system addresses the issue where safety alignment mechanisms cause LLMs to reject benign queries that merely appear risky. DDOR employs delta debugging to pinpoint minimal refusal-triggering fragments (mRTFs), offering phrase-level explanations for refusal occurrences. Based on these mRTFs, the framework generates diverse, context-rich prompts and utilizes multi-oracle validation to filter unsafe or ambiguous cases, creating scalable, model-specific overrefusal test suites with approximately 1K cases per model. Furthermore, DDOR leverages localized mRTFs for targeted prompt repair, effectively reducing overrefusal while preserving original intent and maintaining safety on genuinely harmful inputs.

Key takeaway

For AI Scientists and ML Engineers developing or deploying LLMs, DDOR offers a critical solution for addressing overrefusal. If your models are rejecting benign queries, you should consider implementing DDOR's automated testing and repair framework. This allows you to generate explainable test suites and perform targeted prompt repairs, significantly improving your LLM's usability and reliability without compromising its essential safety guardrails.

Key insights

DDOR uses delta debugging to explain and mitigate LLM overrefusal in black-box settings, improving usability and safety.

Principles

Overrefusal can be localized to minimal fragments.
Black-box LLM safety can be improved.
Automated testing enhances model robustness.

Method

DDOR applies delta debugging to find minimal refusal-triggering fragments (mRTFs). It then generates diverse prompts, validates them with multiple oracles, and uses mRTFs for targeted prompt repair.

In practice

Generate ~1K overrefusal test cases per model.
Repair prompts using localized mRTFs.
Improve LLM usability without safety compromise.

Topics

Large Language Models
Overrefusal Testing
Delta Debugging
Prompt Repair
AI Safety Alignment
Black-box Testing

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.