Investigating Counterfactual Unfairness in LLMs towards Identities through Humor
Summary
A study investigated counterfactual unfairness in Large Language Models (LLMs) by analyzing their responses to humor when speaker and target identities were swapped. The research employed a framework across three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic and identity-specific disparagement humor. Using interpretable bias metrics, experiments on models like Claude 3.5 Haiku, GPT-4o, DeepSeek-Reasoner, Gemini 2.5 Flash-Lite, and Grok 4 revealed consistent relational disparities. Specifically, jokes told by privileged speakers were refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These findings indicate that LLMs internalize social hierarchies and stereotypes, complicating efforts to achieve fairness and cultural alignment.
Key takeaway
For research scientists and engineers developing LLMs, this analysis highlights that current safety alignment strategies often encode fixed social hierarchies rather than genuine social reasoning. You should move beyond surface-level bias detection to implement dynamic, context-sensitive ethical reasoning in your models. Focus on evaluation frameworks that account for bidirectional bias and the interplay of multiple identity dimensions to prevent representational harms and foster true cultural alignment.
Key insights
LLMs exhibit counterfactual unfairness, applying stricter safety policies and attributing more malicious intent based on speaker-target identity.
Principles
- Humor reveals latent social assumptions in LLMs.
- Bias is bidirectional, reflecting fixed social hierarchies.
- Intersectional identities can amplify or modulate bias effects.
Method
The study used identity swapping in humor generation, intention inference, and impact prediction tasks. It introduced Asymmetric Refusal Rate (ARR) and Speaker Effect (SE) metrics to quantify directional bias.
In practice
- Evaluate LLM fairness using counterfactual identity swaps.
- Prioritize bias mitigation in wealth, body, and physical disability categories.
- Develop context-aware safety evaluations beyond static rules.
Topics
- Counterfactual Unfairness
- Large Language Models
- Bias Detection
- Computational Humor
- Identity Bias
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.