LegalSim-PT: Building a Dataset for Legal Document Simplification in Portuguese Leveraging Linguistic Metrics

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

LegalSim-PT is introduced as the first large-scale Portuguese dataset specifically designed for legal document simplification. This dataset addresses the growing interest in document-level simplification, which requires maintaining fluency, conciseness, and coherence across entire texts, often integrating summarization techniques. Developed by Arthur Scalercio, Maria José Finatto, and Aline Paes, and presented at PROPOR 2026, LegalSim-PT was constructed using data augmentation strategies combined with readability, semantic similarity, and diversity metrics to select high-quality document pairs, reducing the need for extensive manual evaluation. The creators performed a detailed analysis of the dataset's surface features, compared it to existing simplification corpora, and validated its quality through automatic metrics, linguistic indicators, and human evaluations. Additionally, two baseline models were fine-tuned on LegalSim-PT, demonstrating improved performance in document-level simplification.

Key takeaway

For NLP Engineers working on Portuguese language processing, LegalSim-PT offers a critical resource for advancing document simplification, especially in the legal domain. Your efforts to develop or improve text simplification models can directly benefit from fine-tuning on this specialized dataset, potentially leading to more accurate and contextually appropriate outputs. Consider integrating LegalSim-PT into your model training pipelines to enhance performance beyond sentence-level simplification.

Key insights

LegalSim-PT is the first large-scale Portuguese legal document simplification dataset, created using linguistic metrics and data augmentation.

Principles

Method

LegalSim-PT dataset creation involved data augmentation, followed by selection of document pairs using readability, semantic similarity, and diversity metrics. Quality was assessed via automatic metrics, linguistic indicators, and human evaluations.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.