LegalSim-PT: Building a Dataset for Legal Document Simplification in Portuguese Leveraging Linguistic Metrics
Summary
LegalSim-PT is introduced as the first large-scale Portuguese dataset specifically designed for legal document simplification. This dataset addresses the growing interest in document-level simplification, which requires maintaining fluency, conciseness, and coherence across entire texts, often integrating summarization techniques. Developed by Arthur Scalercio, Maria José Finatto, and Aline Paes, and presented at PROPOR 2026, LegalSim-PT was constructed using data augmentation strategies combined with readability, semantic similarity, and diversity metrics to select high-quality document pairs, reducing the need for extensive manual evaluation. The creators performed a detailed analysis of the dataset's surface features, compared it to existing simplification corpora, and validated its quality through automatic metrics, linguistic indicators, and human evaluations. Additionally, two baseline models were fine-tuned on LegalSim-PT, demonstrating improved performance in document-level simplification.
Key takeaway
For NLP Engineers working on Portuguese language processing, LegalSim-PT offers a critical resource for advancing document simplification, especially in the legal domain. Your efforts to develop or improve text simplification models can directly benefit from fine-tuning on this specialized dataset, potentially leading to more accurate and contextually appropriate outputs. Consider integrating LegalSim-PT into your model training pipelines to enhance performance beyond sentence-level simplification.
Key insights
LegalSim-PT is the first large-scale Portuguese legal document simplification dataset, created using linguistic metrics and data augmentation.
Principles
- Document simplification requires preserving fluency, conciseness, and coherence.
- Linguistic metrics can reduce reliance on manual evaluation for dataset creation.
Method
LegalSim-PT dataset creation involved data augmentation, followed by selection of document pairs using readability, semantic similarity, and diversity metrics. Quality was assessed via automatic metrics, linguistic indicators, and human evaluations.
In practice
- Fine-tune models on LegalSim-PT for improved Portuguese legal text simplification.
- Apply data augmentation with linguistic metrics for dataset creation in other languages.
Topics
- Legal Document Simplification
- Portuguese Language Processing
- LegalSim-PT Dataset
- Linguistic Metrics
- Data Augmentation
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.