Towards a Universal Dependencies Corpus for Portuguese Epidemiological Reports
Summary
An ongoing research project is constructing a Universal Dependencies (UD) corpus for Portuguese epidemiological reports, sourced from Brazilian public health documents. The project details the process of building this corpus from PDF reports using a controlled document extraction pipeline that compares layout-aware and raw PDF text extraction. This comparison explicitly addresses how tabular content impacts downstream syntactic analysis. Narrative text within these reports is annotated using multiple UD parsers for Portuguese, including widely used and advanced tools. The outputs of these parsers are systematically compared using descriptive structural indicators and qualitative inspection. The analysis reveals domain-specific challenges in epidemiological texts and highlights that document extraction and representation choices have a greater effect on parsing behavior than the choice of parser alone. Based on these findings, the project identifies robust preprocessing configurations and discusses design choices for a UD-epidemiological corpus to support future research in syntactic parsing, domain adaptation, and natural language processing tasks within epidemiology and public health.
Key takeaway
For NLP engineers developing solutions for public health data, your focus should be on optimizing document extraction and representation strategies. The choice of preprocessing configuration, particularly how you handle layout and tabular data, will likely yield greater improvements in syntactic parsing accuracy than simply selecting a different parser. Investigate robust preprocessing methods to enhance downstream NLP tasks in epidemiology.
Key insights
Document extraction and representation significantly impact parsing epidemiological texts more than parser selection.
Principles
- Layout-aware extraction improves parsing.
- Tabular content affects syntactic analysis.
- Domain-specific texts pose unique challenges.
Method
PDF reports are processed via controlled extraction (layout-aware vs. raw), narrative text is annotated with multiple UD parsers, and outputs are compared using structural indicators and qualitative inspection.
In practice
- Prioritize document extraction methods.
- Address tabular content explicitly.
- Evaluate parsers with domain-specific data.
Topics
- Universal Dependencies
- Portuguese Epidemiological Reports
- Document Extraction
- Syntactic Parsing
- Domain Adaptation
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.