ProText: A Benchmark Dataset for Measuring (Mis)gendering in Long-Form Texts
Summary
ProText is a new benchmark dataset designed to measure gendering and misgendering in stylistically diverse, long-form English texts, particularly within text transformations performed by Large Language Models (LLMs). The dataset categorizes text along three dimensions: Theme nouns (names, occupations, kinship terms), Theme category (stereotypically male, female, or gender-neutral), and Pronoun category (masculine, feminine, gender-neutral, or none). ProText extends beyond traditional pronoun resolution benchmarks and the gender binary to specifically probe biases in summarization and rewrite tasks. A mini case study using two prompts and two models validated ProText, revealing systematic gender bias, especially when inputs lack explicit gender cues or when models default to heteronormative assumptions.
Key takeaway
For research scientists developing or deploying LLMs, you should integrate ProText into your evaluation pipelines to systematically identify and quantify gender bias, especially in text transformation tasks like summarization. Understanding how your models handle gender-neutral inputs and default assumptions is crucial for mitigating misgendering and stereotyping, ensuring more equitable and accurate AI outputs.
Key insights
ProText measures gendering and misgendering in LLM text transformations, revealing systematic gender bias beyond binary assumptions.
Principles
- Gender bias persists in LLMs.
- Explicit gender cues mitigate bias.
- Heteronormative defaults are common.
Method
ProText categorizes text by Theme nouns, Theme category (male, female, neutral), and Pronoun category (masculine, feminine, neutral, none) to probe (mis)gendering in LLM summarization and rewrites.
In practice
- Evaluate LLMs for gender bias.
- Test models with gender-neutral inputs.
- Analyze heteronormative defaults.
Topics
- ProText Dataset
- Gender Bias Measurement
- Large Language Models
- Machine Translation
- Grammatical Gender
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.