Curation of a Palaeohispanic Dataset for Machine Learning

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Intermediate, extended

Summary

A new dataset has been created to enable machine learning research on Palaeohispanic languages, which were spoken in the Iberian Peninsula before 218 BC. These languages, including Iberian, Celtiberian, Lusitanian, Vasconic, and South-Western (Tartessian), are known only from inscriptions and have varying degrees of decipherment. Existing linguistic resources, such as the Hesperia Data Bank, are not formatted for computational analysis. The new dataset, available as a CSV file, contains 1751 entries and 36 feature columns, derived primarily from the Hesperia Data Bank's epigraphic records. It includes transformed attributes like latitude/longitude for location, numerical representations for dating intervals, and cleaned text, alongside categorical encodings for other features, making it suitable for NLP tasks and other computational studies.

Key takeaway

For NLP engineers or historical linguists working with low-resource or ancient languages, this curated Palaeohispanic dataset offers a critical starting point. You should explore its structured format for tasks like morphological analysis, cognate detection, or part-of-speech tagging. The provided Python scripts also allow you to update the dataset as new linguistic knowledge emerges, ensuring your research remains current and robust against evolving interpretations.

Key insights

A new structured dataset enables machine learning analysis of under-resourced Palaeohispanic languages.

Principles

Method

The methodology involves collecting epigraphic data from sources like the Hesperia Data Bank, selecting relevant attributes, and transforming string values into numerical or cleaned text formats suitable for machine learning models, including coordinate mapping for location and interval representation for chronology.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, Data Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.