Curation of a Palaeohispanic Dataset for Machine Learning

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Intermediate, extended

Summary

A new dataset has been created to enable machine learning research on Palaeohispanic languages, which were spoken in the Iberian Peninsula before 218 BC. These languages, including Iberian, Celtiberian, Lusitanian, Vasconic, and South-Western (Tartessian), are known only from inscriptions and have varying degrees of decipherment. Existing linguistic resources, such as the Hesperia Data Bank, are not formatted for computational analysis. The new dataset, available as a CSV file, contains 1751 entries and 36 feature columns, derived primarily from the Hesperia Data Bank's epigraphic records. It includes transformed attributes like latitude/longitude for location, numerical representations for dating intervals, and cleaned text, alongside categorical encodings for other features, making it suitable for NLP tasks and other computational studies.

Key takeaway

For NLP engineers or historical linguists working with low-resource or ancient languages, this curated Palaeohispanic dataset offers a critical starting point. You should explore its structured format for tasks like morphological analysis, cognate detection, or part-of-speech tagging. The provided Python scripts also allow you to update the dataset as new linguistic knowledge emerges, ensuring your research remains current and robust against evolving interpretations.

Key insights

A new structured dataset enables machine learning analysis of under-resourced Palaeohispanic languages.

Principles

Computational methods can advance historical linguistics.
Data transformation is key for ML readiness.
Corpus languages require careful data curation.

Method

The methodology involves collecting epigraphic data from sources like the Hesperia Data Bank, selecting relevant attributes, and transforming string values into numerical or cleaned text formats suitable for machine learning models, including coordinate mapping for location and interval representation for chronology.

In practice

Use latitude/longitude for location features.
Convert natural language dates to numerical intervals.
Clean text by removing epigraphic annotations.

Topics

Palaeohispanic Languages
Dataset Curation
Machine Learning for Linguistics
Hesperia Data Bank
Epigraphic Inscriptions

Code references

gonmarfer2/palaeohispanic-dataset-generator

Best for: NLP Engineer, Research Scientist, Data Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.