ProText: A Benchmark Dataset for Measuring (Mis)gendering in Long-Form Texts

2026-03-31 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

ProText is a new benchmark dataset designed to measure gendering and misgendering in stylistically diverse, long-form English texts, particularly within text transformations performed by Large Language Models (LLMs). The dataset categorizes text along three dimensions: Theme nouns (names, occupations, kinship terms), Theme category (stereotypically male, female, or gender-neutral), and Pronoun category (masculine, feminine, gender-neutral, or none). ProText extends beyond traditional pronoun resolution benchmarks and the gender binary to specifically probe biases in summarization and rewrite tasks. A mini case study using two prompts and two models validated ProText, revealing systematic gender bias, especially when inputs lack explicit gender cues or when models default to heteronormative assumptions.

Key takeaway

For research scientists developing or deploying LLMs, you should integrate ProText into your evaluation pipelines to systematically identify and quantify gender bias, especially in text transformation tasks like summarization. Understanding how your models handle gender-neutral inputs and default assumptions is crucial for mitigating misgendering and stereotyping, ensuring more equitable and accurate AI outputs.

Key insights

ProText measures gendering and misgendering in LLM text transformations, revealing systematic gender bias beyond binary assumptions.

Principles

Gender bias persists in LLMs.
Explicit gender cues mitigate bias.
Heteronormative defaults are common.

Method

ProText categorizes text by Theme nouns, Theme category (male, female, neutral), and Pronoun category (masculine, feminine, neutral, none) to probe (mis)gendering in LLM summarization and rewrites.

In practice

Evaluate LLMs for gender bias.
Test models with gender-neutral inputs.
Analyze heteronormative defaults.

Topics

ProText Dataset
Gender Bias Measurement
Large Language Models
Machine Translation
Grammatical Gender

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.