Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation
Summary
A novel in-context learning approach significantly improves low-resource machine translation from Coptic to English by augmenting prompts with syntactic information from Universal Dependencies (UD) parses. Researchers from Georgetown University's Corpling Lab found that while dictionary-based glosses (LEX) are effective, combining them with syntactic data (SYN) yields new state-of-the-art results for Coptic translation across various model sizes, including Gemma and GPT-4.1. The method explores different syntactic representations, such as raw parser outputs (CoNLLU), verbalized dependency relations (DEP), and targeted instructions for difficult constructions (CON). This combined approach consistently outperforms lexicon-only augmentation, demonstrating that even high-quality automatic parses are sufficient and that the benefits extend to out-of-domain texts like Coptic ostraca.
Key takeaway
For research scientists developing machine translation systems for low-resource languages, integrating syntactic information from Universal Dependencies parses alongside traditional dictionary-based augmentation is crucial. This approach, particularly the LEX+SYN combination, has been shown to achieve statistically significant improvements in translation quality for Coptic. You should consider operationalizing syntactic data through raw CoNLL-U outputs, verbalized dependency relations, and targeted instructions for difficult grammatical constructions to maximize performance, even with automatically generated parses.
Key insights
Syntactic augmentation, combined with lexical information, significantly boosts low-resource machine translation quality.
Principles
- Syntactic information complements lexical data for MT.
- High-quality automatic parses are sufficient for ICL MT.
- Model performance varies with syntactic representation.
Method
The method involves augmenting LLM prompts with dictionary entries and various forms of Universal Dependencies syntactic parses, including raw CoNLL-U, verbalized dependencies, and targeted construction-specific instructions.
In practice
- Combine dictionary glosses with UD parses for LRL MT.
- Experiment with raw vs. verbalized syntactic inputs.
- Use construction-specific instructions for complex grammar.
Topics
- Low-Resource Machine Translation
- Coptic Language
- Universal Dependencies
- In-Context Learning
- Syntactic Augmentation
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.