MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
Summary
MORPHOGEN is a new large-scale, multilingual benchmark dataset designed to evaluate the ability of Large Language Models (LLMs) to handle grammatical gender and morphological agreement in three typologically diverse languages: French, Arabic, and Hindi. The core task, GENFORM, requires LLMs to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. Researchers constructed a high-quality synthetic dataset for these languages and benchmarked 15 popular multilingual LLMs, ranging from 2 billion to 70 billion parameters. The evaluation revealed significant gaps in current models' handling of morphological gender, with larger models generally outperforming smaller ones, particularly in languages with complex morphology like Arabic. The benchmark also uncovered notable gender biases, with French and Arabic models often defaulting to masculine forms, while some Hindi models showed a feminine skew.
Key takeaway
For AI Engineers developing or deploying multilingual LLMs, understanding gender-aware morphological generation is crucial for inclusive applications. You should evaluate your models using benchmarks like MORPHOGEN, paying close attention to performance gaps in morphologically rich languages and identifying potential masculine or feminine biases. Prioritize models with higher parameter counts for complex languages and implement targeted debiasing strategies to ensure equitable and accurate linguistic outputs.
Key insights
MORPHOGEN evaluates LLM gender-aware morphological generation across French, Arabic, and Hindi, revealing performance gaps and biases.
Principles
- Parameter size is critical for complex morphology.
- Gender bias varies significantly across languages and models.
- Models struggle with gender interference in multi-entity sentences.
Method
The GENFORM task prompts LLMs to rewrite first-person sentences in the opposite gender, preserving meaning and structure, using language-specific morphological rules and synthetic data generation.
In practice
- Benchmark LLMs on MORPHOGEN for gender-aware generation.
- Analyze $\triangle SGA$ scores to detect masculine/feminine bias.
- Use GIoU to penalize over-generation in gender transformations.
Topics
- MORPHOGEN Benchmark
- Gender-Aware Morphological Generation
- Multilingual LLMs
- Grammatical Gender
- French, Arabic, Hindi
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.