Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new systematic approach addresses the challenge of translating idiomatic expressions between Spanish and Galician using neural machine translation. Researchers first constructed a high-quality parallel dataset of idioms, manually aligning expressions across both languages. This initial dataset was then significantly expanded into a large synthetic parallel corpus through the use of Large Language Models (LLMs). The augmentation strategy prioritized the most frequently observed idioms in authentic corpora. The expanded dataset was subsequently used to retrain a sequence-to-sequence (seq2seq) translation model. Evaluation demonstrated a significant improvement in idiom translation, along with a slight enhancement in the model's overall performance, outperforming baseline models and state-of-the-art LLM-based translators like SalamandraTA.

Key takeaway

For research scientists developing machine translation systems for low-resource or idiom-rich language pairs, consider implementing a two-stage data augmentation strategy. Manually curate a small, high-quality idiom dataset, then use LLMs to generate a larger synthetic corpus, focusing on high-frequency idioms. This approach can significantly enhance idiom translation accuracy and overall model performance, as demonstrated for Spanish-Galician.

Key insights

Synthetic data augmentation using LLMs significantly improves idiom translation in neural machine translation.

Principles

Manual alignment ensures high-quality idiom datasets.
Prioritize frequent idioms for synthetic augmentation.

Method

Build a manually aligned idiom dataset, then use LLMs to synthetically augment it, prioritizing frequent idioms, and retrain a seq2seq model.

In practice

Create high-quality parallel idiom datasets.
Augment data with LLMs for low-resource language pairs.

Topics

Idiom Translation
Neural Machine Translation
Spanish-Galician Translation
Large Language Models
Data Augmentation

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.