AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Expert, quick

Summary

AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset, representing the first openly licensed dependency-parsed treebank of Greek. It spans eight diachronic periods, from Archaic to Modern Greek, unified under a PROIEL XML 2.0 schema. This resource also features verse-level cross-alignment of the New Testament with Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC leverages the PROIEL Treebank Family and employs a Stanford Stanza PROIEL-trained workflow for annotation. Sentence-level alignment is achieved using LaBSE, while word-level alignment utilizes multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release offers curated samples and an open-source toolkit, with the full annotated corpus partitions expected in the v0.5 release after audit.

Key takeaway

For NLP Engineers or Research Scientists working with historical linguistics or multilingual text alignment, AthDGC offers a critical new resource. You can utilize this open Greek treebank to train models on diachronic language evolution or to conduct comparative studies of New Testament translations. Consider integrating the PROIEL XML 2.0 schema and the provided toolkit into your research to enhance cross-linguistic analysis and model development for ancient languages.

Key insights

AthDGC provides the first openly licensed, diachronic Greek dependency treebank with multilingual New Testament alignments.

Principles

Method

The workflow uses Stanford Stanza for annotation, LaBSE for sentence alignment, and multilingual-BERT attention via AwesomeAlign for word-level alignment.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.