ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

The ConGA (Contextual Gender Annotation) framework provides linguistically grounded guidelines for word-level gender annotation, specifically addressing challenges in Machine Translation (MT) and Large Language Models (LLMs) when translating between gender-neutral and morphologically gendered languages. English, being largely gender-neutral, often results in MT systems defaulting to masculine forms when translated into languages like Italian, which requires explicit grammatical gender agreement. ConGA distinguishes semantic gender in English using Masculine (M), Feminine (F), and Ambiguous (A) tags, and grammatical gender in Italian with Masculine (M) and Feminine (F) tags, incorporating entity-level identifiers for cross-sentence tracking. Applying ConGA to the gENder-IT dataset created a gold-standard resource, revealing systematic masculine overuse and inconsistent feminine realization in current MT systems, thereby offering a methodology and benchmark for more gender-aware multilingual NLP.

Key takeaway

For AI scientists and research scientists developing or evaluating Machine Translation and Large Language Models, adopting the ConGA framework is crucial for identifying and mitigating gender bias. Your systems' accuracy and fairness can be significantly improved by using ConGA's fine-grained annotation to create gold-standard datasets and benchmark gender performance, especially when translating between gender-neutral and morphologically gendered languages. This approach helps ensure more equitable and accurate multilingual NLP systems.

Key insights

ConGA provides a framework for fine-grained gender annotation to mitigate bias in machine translation.

Principles

Method

ConGA uses M/F/A tags for English semantic gender and M/F tags for Italian grammatical gender, combined with entity-level identifiers for cross-sentence tracking.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.