GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German
Summary
GRUFF is a new large-scale dataset designed to measure pronoun fidelity in German Large Language Models. Unlike English, German features four different gender agreement systems in nouns and four sets of pronouns, making it a more complex language for studying referential reasoning and bias. Research using GRUFF reveals that LLMs exhibit strong grammatical agreement for masculine and feminine entities when explicit context is absent, but struggle with neopronouns like "xier" and "en". The study also found that models are generally not robust to distracting discourse entities. Interestingly, encoder-only models demonstrated greater robustness in German compared to English, underscoring the significance of grammatical gender. Furthermore, occupational stereotypes in this context showed poor correlation across grammatical cases and most models, with exceptions for those sharing closely related architectures. The authors have released all code and data to support further research.
Key takeaway
For NLP Engineers developing German language models, you should prioritize evaluating your models' pronoun fidelity, especially concerning neopronouns like "xier" and "en". The GRUFF dataset highlights that current LLMs struggle with these and with distractor robustness. Incorporate specific tests for grammatical gender agreement and robustness to ensure your models handle diverse linguistic nuances and reduce potential biases. This will improve model fairness and accuracy in German applications.
Key insights
The GRUFF dataset reveals German LLMs struggle with neopronouns and distractor robustness, despite strong grammatical gender agreement.
Principles
- Grammatical gender significantly impacts LLM robustness.
- Neopronoun support in LLMs is currently weak.
- Distractors generally reduce LLM pronoun fidelity.
Method
The GRUFF dataset measures pronoun fidelity in German LLMs across four gender agreement systems and four pronoun sets, assessing reuse of previously-specified pronouns despite distractors.
In practice
- Evaluate LLMs for neopronoun handling.
- Test model robustness against discourse distractors.
- Consider grammatical gender's role in model design.
Topics
- GRUFF Dataset
- Pronoun Fidelity
- German Language Models
- Grammatical Gender
- Neopronouns
- Stereotypical Bias
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.