GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

GRUFF is a new large-scale dataset designed to measure pronoun fidelity in German Large Language Models. Unlike English, German features four different gender agreement systems in nouns and four sets of pronouns, making it a more complex language for studying referential reasoning and bias. Research using GRUFF reveals that LLMs exhibit strong grammatical agreement for masculine and feminine entities when explicit context is absent, but struggle with neopronouns like "xier" and "en". The study also found that models are generally not robust to distracting discourse entities. Interestingly, encoder-only models demonstrated greater robustness in German compared to English, underscoring the significance of grammatical gender. Furthermore, occupational stereotypes in this context showed poor correlation across grammatical cases and most models, with exceptions for those sharing closely related architectures. The authors have released all code and data to support further research.

Key takeaway

For NLP Engineers developing German language models, you should prioritize evaluating your models' pronoun fidelity, especially concerning neopronouns like "xier" and "en". The GRUFF dataset highlights that current LLMs struggle with these and with distractor robustness. Incorporate specific tests for grammatical gender agreement and robustness to ensure your models handle diverse linguistic nuances and reduce potential biases. This will improve model fairness and accuracy in German applications.

Key insights

The GRUFF dataset reveals German LLMs struggle with neopronouns and distractor robustness, despite strong grammatical gender agreement.

Principles

Grammatical gender significantly impacts LLM robustness.
Neopronoun support in LLMs is currently weak.
Distractors generally reduce LLM pronoun fidelity.

Method

The GRUFF dataset measures pronoun fidelity in German LLMs across four gender agreement systems and four pronoun sets, assessing reuse of previously-specified pronouns despite distractors.

In practice

Evaluate LLMs for neopronoun handling.
Test model robustness against discourse distractors.
Consider grammatical gender's role in model design.

Topics

GRUFF Dataset
Pronoun Fidelity
German Language Models
Grammatical Gender
Neopronouns
Stereotypical Bias

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.