18th-century Fauna and Flora: from Named Entities to the problems of standardization

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

This article details the challenges and processes involved in preparing 18th-century Portuguese historical sources, specifically those concerning fauna and flora, for computational analysis. The authors initially anticipated low lexical ambiguity due to the specialized domain but encountered significant variation. The work outlines a roadmap for orthographic normalization, describes the creation of an annotated corpus of Named Entities, and discusses the problems arising from lexical variation within these specialized thesauri. The study aims to contribute to the understanding of historical source normalization and emphasize the importance of robust practices in this field.

Key takeaway

For research scientists working with historical linguistic data, you should anticipate significant lexical variation even in specialized domains. Your computational processing pipelines will benefit from a well-defined orthographic normalization process and a carefully constructed annotated corpus of Named Entities to accurately handle historical language nuances.

Key insights

Lexical variation in historical specialized texts presents significant challenges for computational processing.

Principles

Method

The method involves creating an annotated Named Entity corpus and developing an orthographic normalization roadmap to address lexical variation in historical Portuguese texts.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.