Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

2024-08-06 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Ethics & Fairness · Depth: Advanced, extended

Summary

A study investigated counterfactual unfairness in Large Language Models (LLMs) by analyzing their responses to humor when speaker and target identities were swapped. The research employed a framework across three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic and identity-specific disparagement humor. Using interpretable bias metrics, experiments on models like Claude 3.5 Haiku, GPT-4o, DeepSeek-Reasoner, Gemini 2.5 Flash-Lite, and Grok 4 revealed consistent relational disparities. Specifically, jokes told by privileged speakers were refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These findings indicate that LLMs internalize social hierarchies and stereotypes, complicating efforts to achieve fairness and cultural alignment.

Key takeaway

For research scientists and engineers developing LLMs, this analysis highlights that current safety alignment strategies often encode fixed social hierarchies rather than genuine social reasoning. You should move beyond surface-level bias detection to implement dynamic, context-sensitive ethical reasoning in your models. Focus on evaluation frameworks that account for bidirectional bias and the interplay of multiple identity dimensions to prevent representational harms and foster true cultural alignment.

Key insights

LLMs exhibit counterfactual unfairness, applying stricter safety policies and attributing more malicious intent based on speaker-target identity.

Principles

Humor reveals latent social assumptions in LLMs.
Bias is bidirectional, reflecting fixed social hierarchies.
Intersectional identities can amplify or modulate bias effects.

Method

The study used identity swapping in humor generation, intention inference, and impact prediction tasks. It introduced Asymmetric Refusal Rate (ARR) and Speaker Effect (SE) metrics to quantify directional bias.

In practice

Evaluate LLM fairness using counterfactual identity swaps.
Prioritize bias mitigation in wealth, body, and physical disability categories.
Develop context-aware safety evaluations beyond static rules.

Topics

Counterfactual Unfairness
Large Language Models
Bias Detection
Computational Humor
Identity Bias

Code references

shubinkim/humor-counterfactual-unfairness

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.