G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment
Summary
G-IdiomAlign is a new gloss-pivoted benchmark designed to evaluate cross-lingual idiom alignment across large language models. It comprises 18,785 idiom pairs spanning 36 language pairs, including nine core languages and four additional languages, with each idiom anchored by an English gloss from Wiktionary. The benchmark supports two evaluation protocols: a Multiple-Choice Idiom Equivalence task with typed distractors for error attribution, and a Gloss-Contrastive Generation task comparing No-gloss and With-gloss inputs. Experiments with models like DeepSeek-V3.2, Gemini-2.5-Pro, and Qwen3-8B consistently show a bias towards literal translation, particularly in low-resource languages. While explicit glosses improve generation performance, overall accuracy remains modest. Attention-based diagnostics on Qwen3-8B indicate that successful gloss-aided generations correlate with stronger gloss anchoring in attention heads.
Key takeaway
For NLP Engineers developing cross-lingual LLM applications, you should recognize that current models exhibit a strong literal translation bias for idioms. To improve performance, explicitly integrate semantic pivots like English glosses during translation tasks. Your evaluation protocols should include diagnostic benchmarks, such as multiple-choice tasks with typed distractors, to pinpoint specific failure modes and measure the impact of semantic grounding on figurative meaning transfer.
Key insights
Cross-lingual idiom alignment challenges LLMs due to non-compositionality, with glosses offering semantic grounding but literal bias remaining dominant.
Principles
- Idioms are non-compositional and culturally grounded.
- LLMs exhibit a strong literal translation bias.
- English glosses provide a robust semantic pivot.
Method
G-IdiomAlign constructs idiom pairs by extracting Wiktionary glosses, retrieving top-k candidates in an embedding space, enforcing mutual nearest neighbor (MNN) agreement, and applying distribution-aware filtering for high-confidence alignment.
In practice
- Integrate explicit glosses for idiom translation.
- Use typed distractors for fine-grained error analysis.
- Apply attention diagnostics to trace gloss anchoring.
Topics
- Cross-Lingual Idiom Alignment
- Large Language Models
- NLP Benchmarking
- Semantic Grounding
- Wiktionary
- Attention Mechanisms
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.