AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels
Summary
AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset, representing the first openly licensed dependency-parsed treebank of Greek. It spans eight diachronic periods, from Archaic to Modern Greek, unified under a PROIEL XML 2.0 schema. This resource also features verse-level cross-alignment of the New Testament with Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC leverages the PROIEL Treebank Family and employs a Stanford Stanza PROIEL-trained workflow for annotation. Sentence-level alignment is achieved using LaBSE, while word-level alignment utilizes multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release offers curated samples and an open-source toolkit, with the full annotated corpus partitions expected in the v0.5 release after audit.
Key takeaway
For NLP Engineers or Research Scientists working with historical linguistics or multilingual text alignment, AthDGC offers a critical new resource. You can utilize this open Greek treebank to train models on diachronic language evolution or to conduct comparative studies of New Testament translations. Consider integrating the PROIEL XML 2.0 schema and the provided toolkit into your research to enhance cross-linguistic analysis and model development for ancient languages.
Key insights
AthDGC provides the first openly licensed, diachronic Greek dependency treebank with multilingual New Testament alignments.
Principles
- Unify diachronic data under single schema.
- Cross-align texts for linguistic research.
- Open-source tools enhance accessibility.
Method
The workflow uses Stanford Stanza for annotation, LaBSE for sentence alignment, and multilingual-BERT attention via AwesomeAlign for word-level alignment.
In practice
- Explore diachronic Greek linguistic changes.
- Analyze New Testament translations across languages.
- Integrate PROIEL XML 2.0 schema in projects.
Topics
- Diachronic Linguistics
- Greek Language
- Dependency Treebank
- Multilingual Alignment
- PROIEL Treebank
- Natural Language Processing
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.