DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
Summary
DDOR (Delta Debugging for OverRefusal) is an automated and explainable framework designed to test and repair overrefusal in large language models (LLMs) operating in a black-box setting. This system addresses the issue where safety alignment mechanisms cause LLMs to reject benign queries that merely appear risky. DDOR employs delta debugging to pinpoint minimal refusal-triggering fragments (mRTFs), offering phrase-level explanations for refusal occurrences. Based on these mRTFs, the framework generates diverse, context-rich prompts and utilizes multi-oracle validation to filter unsafe or ambiguous cases, creating scalable, model-specific overrefusal test suites with approximately 1K cases per model. Furthermore, DDOR leverages localized mRTFs for targeted prompt repair, effectively reducing overrefusal while preserving original intent and maintaining safety on genuinely harmful inputs.
Key takeaway
For AI Scientists and ML Engineers developing or deploying LLMs, DDOR offers a critical solution for addressing overrefusal. If your models are rejecting benign queries, you should consider implementing DDOR's automated testing and repair framework. This allows you to generate explainable test suites and perform targeted prompt repairs, significantly improving your LLM's usability and reliability without compromising its essential safety guardrails.
Key insights
DDOR uses delta debugging to explain and mitigate LLM overrefusal in black-box settings, improving usability and safety.
Principles
- Overrefusal can be localized to minimal fragments.
- Black-box LLM safety can be improved.
- Automated testing enhances model robustness.
Method
DDOR applies delta debugging to find minimal refusal-triggering fragments (mRTFs). It then generates diverse prompts, validates them with multiple oracles, and uses mRTFs for targeted prompt repair.
In practice
- Generate ~1K overrefusal test cases per model.
- Repair prompts using localized mRTFs.
- Improve LLM usability without safety compromise.
Topics
- Large Language Models
- Overrefusal Testing
- Delta Debugging
- Prompt Repair
- AI Safety Alignment
- Black-box Testing
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.