Exploring automatic terminology extraction from historical medical data

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Health & Medical Research · Depth: Advanced, quick

Summary

This study evaluates various automatic terminology extraction methods applied to historical medical texts with non-standard orthography. Researchers tested two linguistic pattern-based methods, four prompt-based Generative AI models, and one BERT-like model, some of which were fine-tuned for terminology extraction, including one specialized in Portuguese medical terms. Four distinct prompting strategies were employed for the GenAI models. The test data comprised chapter fifteen of "Aviso 'a Gente do Mar sobre a sua Saude," a 1794 Portuguese translation of an 18th-century French medical text, manually annotated for terminology. Evaluation focused on f-measure and pure precision to assess how automatic methods could augment manual token-based annotation. Results indicate that combining multiple automatic extraction methods can significantly enhance coverage, achieving over 90% recall on the test data, despite individual models showing limited extraction quality.

Key takeaway

For NLP Engineers working with historical or orthographically inconsistent datasets, consider a multi-model approach to terminology extraction. Your current single-model solutions may underperform, but combining linguistic pattern methods, fine-tuned BERT-like models, and prompt-engineered Generative AI can achieve over 90% recall, significantly improving data coverage for downstream tasks.

Key insights

Combining multiple automatic methods significantly improves terminology extraction from historical texts with non-standard orthography.

Principles

Historical texts challenge modern NLP.
Model combination boosts recall.
Prompting strategies impact GenAI.

Method

The study tested linguistic patterns, fine-tuned BERT-like models, and prompt-based GenAI with four strategies on 18th-century Portuguese medical texts, evaluating f-measure and precision against manual annotations.

In practice

Combine extractors for higher recall.
Fine-tune models for specific domains.
Experiment with GenAI prompting.

Topics

Automatic Terminology Extraction
Historical Medical Data
Generative AI Models
BERT-like Models
Linguistic Pattern Methods

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.