MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, extended

Summary

MORPHOGEN is a new large-scale, multilingual benchmark dataset designed to evaluate the ability of Large Language Models (LLMs) to handle grammatical gender and morphological agreement in three typologically diverse languages: French, Arabic, and Hindi. The core task, GENFORM, requires LLMs to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. Researchers constructed a high-quality synthetic dataset for these languages and benchmarked 15 popular multilingual LLMs, ranging from 2 billion to 70 billion parameters. The evaluation revealed significant gaps in current models' handling of morphological gender, with larger models generally outperforming smaller ones, particularly in languages with complex morphology like Arabic. The benchmark also uncovered notable gender biases, with French and Arabic models often defaulting to masculine forms, while some Hindi models showed a feminine skew.

Key takeaway

For AI Engineers developing or deploying multilingual LLMs, understanding gender-aware morphological generation is crucial for inclusive applications. You should evaluate your models using benchmarks like MORPHOGEN, paying close attention to performance gaps in morphologically rich languages and identifying potential masculine or feminine biases. Prioritize models with higher parameter counts for complex languages and implement targeted debiasing strategies to ensure equitable and accurate linguistic outputs.

Key insights

MORPHOGEN evaluates LLM gender-aware morphological generation across French, Arabic, and Hindi, revealing performance gaps and biases.

Principles

Parameter size is critical for complex morphology.
Gender bias varies significantly across languages and models.
Models struggle with gender interference in multi-entity sentences.

Method

The GENFORM task prompts LLMs to rewrite first-person sentences in the opposite gender, preserving meaning and structure, using language-specific morphological rules and synthetic data generation.

In practice

Benchmark LLMs on MORPHOGEN for gender-aware generation.
Analyze $\triangle SGA$ scores to detect masculine/feminine bias.
Use GIoU to penalize over-generation in gender transformations.

Topics

MORPHOGEN Benchmark
Gender-Aware Morphological Generation
Multilingual LLMs
Grammatical Gender
French, Arabic, Hindi

Code references

arnav10goel/Morphogen

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.