18th-century Fauna and Flora: from Named Entities to the problems of standardization
Summary
This article details the challenges and processes involved in preparing 18th-century Portuguese historical sources, specifically those concerning fauna and flora, for computational analysis. The authors initially anticipated low lexical ambiguity due to the specialized domain but encountered significant variation. The work outlines a roadmap for orthographic normalization, describes the creation of an annotated corpus of Named Entities, and discusses the problems arising from lexical variation within these specialized thesauri. The study aims to contribute to the understanding of historical source normalization and emphasize the importance of robust practices in this field.
Key takeaway
For research scientists working with historical linguistic data, you should anticipate significant lexical variation even in specialized domains. Your computational processing pipelines will benefit from a well-defined orthographic normalization process and a carefully constructed annotated corpus of Named Entities to accurately handle historical language nuances.
Key insights
Lexical variation in historical specialized texts presents significant challenges for computational processing.
Principles
- Specialized domains do not guarantee low lexical ambiguity.
- Orthographic normalization is critical for historical texts.
Method
The method involves creating an annotated Named Entity corpus and developing an orthographic normalization roadmap to address lexical variation in historical Portuguese texts.
In practice
- Annotate historical corpora for Named Entities.
- Develop normalization roadmaps for lexical variation.
Topics
- 18th-century Portuguese
- Historical Sources
- Fauna and Flora
- Named Entity Recognition
- Orthographic Normalization
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.