Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach
Summary
A new systematic approach addresses the challenge of translating idiomatic expressions between Spanish and Galician using neural machine translation. Researchers first constructed a high-quality parallel dataset of idioms, manually aligning expressions across both languages. This initial dataset was then significantly expanded into a large synthetic parallel corpus through the use of Large Language Models (LLMs). The augmentation strategy prioritized the most frequently observed idioms in authentic corpora. The expanded dataset was subsequently used to retrain a sequence-to-sequence (seq2seq) translation model. Evaluation demonstrated a significant improvement in idiom translation, along with a slight enhancement in the model's overall performance, outperforming baseline models and state-of-the-art LLM-based translators like SalamandraTA.
Key takeaway
For research scientists developing machine translation systems for low-resource or idiom-rich language pairs, consider implementing a two-stage data augmentation strategy. Manually curate a small, high-quality idiom dataset, then use LLMs to generate a larger synthetic corpus, focusing on high-frequency idioms. This approach can significantly enhance idiom translation accuracy and overall model performance, as demonstrated for Spanish-Galician.
Key insights
Synthetic data augmentation using LLMs significantly improves idiom translation in neural machine translation.
Principles
- Manual alignment ensures high-quality idiom datasets.
- Prioritize frequent idioms for synthetic augmentation.
Method
Build a manually aligned idiom dataset, then use LLMs to synthetically augment it, prioritizing frequent idioms, and retrain a seq2seq model.
In practice
- Create high-quality parallel idiom datasets.
- Augment data with LLMs for low-resource language pairs.
Topics
- Idiom Translation
- Neural Machine Translation
- Spanish-Galician Translation
- Large Language Models
- Data Augmentation
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.