Exploring automatic terminology extraction from historical medical data
Summary
This study evaluates various automatic terminology extraction methods applied to historical medical texts with non-standard orthography. Researchers tested two linguistic pattern-based methods, four prompt-based Generative AI models, and one BERT-like model, some of which were fine-tuned for terminology extraction, including one specialized in Portuguese medical terms. Four distinct prompting strategies were employed for the GenAI models. The test data comprised chapter fifteen of "Aviso 'a Gente do Mar sobre a sua Saude," a 1794 Portuguese translation of an 18th-century French medical text, manually annotated for terminology. Evaluation focused on f-measure and pure precision to assess how automatic methods could augment manual token-based annotation. Results indicate that combining multiple automatic extraction methods can significantly enhance coverage, achieving over 90% recall on the test data, despite individual models showing limited extraction quality.
Key takeaway
For NLP Engineers working with historical or orthographically inconsistent datasets, consider a multi-model approach to terminology extraction. Your current single-model solutions may underperform, but combining linguistic pattern methods, fine-tuned BERT-like models, and prompt-engineered Generative AI can achieve over 90% recall, significantly improving data coverage for downstream tasks.
Key insights
Combining multiple automatic methods significantly improves terminology extraction from historical texts with non-standard orthography.
Principles
- Historical texts challenge modern NLP.
- Model combination boosts recall.
- Prompting strategies impact GenAI.
Method
The study tested linguistic patterns, fine-tuned BERT-like models, and prompt-based GenAI with four strategies on 18th-century Portuguese medical texts, evaluating f-measure and precision against manual annotations.
In practice
- Combine extractors for higher recall.
- Fine-tune models for specific domains.
- Experiment with GenAI prompting.
Topics
- Automatic Terminology Extraction
- Historical Medical Data
- Generative AI Models
- BERT-like Models
- Linguistic Pattern Methods
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.