DeFrame: Debiasing Large Language Models Against Framing Effects
Summary
DeFrame is a novel debiasing framework addressing "framing effects" in large language models (LLMs), where semantically equivalent prompts phrased differently ("A is better than B" vs. "B is worse than A") lead to inconsistent fairness outcomes. The research introduces "framing disparity" to quantify this issue, revealing that existing debiasing methods often fail to mitigate these variations, despite improving overall bias. For instance, on the BBQ benchmark, bias under negative framings was 2x to 4x larger than positive. DeFrame, inspired by dual-process theory, employs a three-stage prompting process—Framing Integration, Guideline Generation, and Self-Revision—to encourage LLMs to consider alternative framings and revise responses. Experiments across 8 LLMs, including LLaMA-3.2-3b-Instruct and Gemma3-12b-Instruct, on benchmarks like BBQ, DoNotAnswer-Framed, and 70Decisions-Framed, demonstrate DeFrame's effectiveness, reducing framing disparity by 92% and overall bias by 93% on average in BBQ.
Key takeaway
For AI Scientists and Machine Learning Engineers deploying LLMs in sensitive applications, you must recognize that standard fairness evaluations are insufficient. Your models can exhibit significant "framing disparity," where bias levels shift dramatically based on prompt wording. To ensure robust and consistent fairness, you should integrate framing-aware debiasing techniques like DeFrame into your development and evaluation workflows, preventing hidden biases from leading to discriminatory outcomes.
Key insights
LLM fairness evaluations are highly sensitive to prompt framing, necessitating methods that ensure consistency across phrasing.
Principles
- LLM fairness metrics vary significantly with prompt framing.
- Current debiasing methods often fail to reduce framing-induced disparities.
- Dual-process theory can guide robust debiasing strategies.
Method
DeFrame integrates alternative framings, generates fairness guidelines, and then self-revises initial LLM responses to ensure consistency and reduce bias.
In practice
- Quantify "framing disparity" by testing positive and negative prompt framings.
- Use multi-stage prompting to improve LLM consistency across framings.
- Consider dual-process theory for designing debiasing workflows.
Topics
- Large Language Models
- Bias Mitigation
- Fairness Evaluation
- Framing Effect
- Prompt Engineering
- DeFrame Framework
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.