Towards a Universal Dependencies Corpus for Portuguese Epidemiological Reports

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Science & Research — Health & Medical Research, Mathematics & Computational Sciences, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

An ongoing research project is constructing a Universal Dependencies (UD) corpus for Portuguese epidemiological reports, sourced from Brazilian public health documents. The project details the process of building this corpus from PDF reports using a controlled document extraction pipeline that compares layout-aware and raw PDF text extraction. This comparison explicitly addresses how tabular content impacts downstream syntactic analysis. Narrative text within these reports is annotated using multiple UD parsers for Portuguese, including widely used and advanced tools. The outputs of these parsers are systematically compared using descriptive structural indicators and qualitative inspection. The analysis reveals domain-specific challenges in epidemiological texts and highlights that document extraction and representation choices have a greater effect on parsing behavior than the choice of parser alone. Based on these findings, the project identifies robust preprocessing configurations and discusses design choices for a UD-epidemiological corpus to support future research in syntactic parsing, domain adaptation, and natural language processing tasks within epidemiology and public health.

Key takeaway

For NLP engineers developing solutions for public health data, your focus should be on optimizing document extraction and representation strategies. The choice of preprocessing configuration, particularly how you handle layout and tabular data, will likely yield greater improvements in syntactic parsing accuracy than simply selecting a different parser. Investigate robust preprocessing methods to enhance downstream NLP tasks in epidemiology.

Key insights

Document extraction and representation significantly impact parsing epidemiological texts more than parser selection.

Principles

Layout-aware extraction improves parsing.
Tabular content affects syntactic analysis.
Domain-specific texts pose unique challenges.

Method

PDF reports are processed via controlled extraction (layout-aware vs. raw), narrative text is annotated with multiple UD parsers, and outputs are compared using structural indicators and qualitative inspection.

In practice

Prioritize document extraction methods.
Address tabular content explicitly.
Evaluate parsers with domain-specific data.

Topics

Universal Dependencies
Portuguese Epidemiological Reports
Document Extraction
Syntactic Parsing
Domain Adaptation

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.