Curation of a Palaeohispanic Dataset for Machine Learning
Summary
A new dataset has been created to enable machine learning research on Palaeohispanic languages, which were spoken in the Iberian Peninsula before 218 BC. These languages, including Iberian, Celtiberian, Lusitanian, Vasconic, and South-Western (Tartessian), are known only from inscriptions and have varying degrees of decipherment. Existing linguistic resources, such as the Hesperia Data Bank, are not formatted for computational analysis. The new dataset, available as a CSV file, contains 1751 entries and 36 feature columns, derived primarily from the Hesperia Data Bank's epigraphic records. It includes transformed attributes like latitude/longitude for location, numerical representations for dating intervals, and cleaned text, alongside categorical encodings for other features, making it suitable for NLP tasks and other computational studies.
Key takeaway
For NLP engineers or historical linguists working with low-resource or ancient languages, this curated Palaeohispanic dataset offers a critical starting point. You should explore its structured format for tasks like morphological analysis, cognate detection, or part-of-speech tagging. The provided Python scripts also allow you to update the dataset as new linguistic knowledge emerges, ensuring your research remains current and robust against evolving interpretations.
Key insights
A new structured dataset enables machine learning analysis of under-resourced Palaeohispanic languages.
Principles
- Computational methods can advance historical linguistics.
- Data transformation is key for ML readiness.
- Corpus languages require careful data curation.
Method
The methodology involves collecting epigraphic data from sources like the Hesperia Data Bank, selecting relevant attributes, and transforming string values into numerical or cleaned text formats suitable for machine learning models, including coordinate mapping for location and interval representation for chronology.
In practice
- Use latitude/longitude for location features.
- Convert natural language dates to numerical intervals.
- Clean text by removing epigraphic annotations.
Topics
- Palaeohispanic Languages
- Dataset Curation
- Machine Learning for Linguistics
- Hesperia Data Bank
- Epigraphic Inscriptions
Code references
Best for: NLP Engineer, Research Scientist, Data Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.