Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new study investigates counterfactual unfairness in large language models (LLMs) by analyzing their responses to humor, specifically when speaker and addressee identities are swapped. The research framework covers three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, encompassing both identity-agnostic and identity-specific disparagement humor. By introducing interpretable bias metrics, the study reveals consistent relational disparities in "state-of-the-art" models. Experiments show that jokes from privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These findings indicate that LLMs exhibit coexisting sensitivity and stereotyping, complicating fairness and cultural alignment efforts.

Key takeaway

For research scientists developing or deploying LLMs, understanding how models internalize social assumptions is critical. Your fairness evaluations should extend beyond basic content moderation to include nuanced counterfactual scenarios, such as identity swaps in humor. This will help uncover subtle biases and improve cultural alignment, ensuring models do not inadvertently perpetuate harmful stereotypes.

Key insights

LLMs exhibit counterfactual unfairness in humor, showing bias based on speaker/addressee identity swaps.

Principles

Method

The study uses a framework with three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, applying interpretable bias metrics under identity swaps.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.