LexIris-pt and LexBert-pt: Specialized Sentence Embeddings for Legal Similarity in Brazilian Portuguese
Summary
LexIris-pt and LexBert-pt are two specialized sentence embedding models designed for the Portuguese legal domain, developed through supervised fine-tuning of BERT-based models. Researchers evaluated these models using a comparative protocol involving three stages: zero-shot inference with pretrained embeddings, supervised fine-tuning on pairs of initial petitions, and vector retrieval with incremental clustering over a corpus of 20,000 initial petitions. The study, presented at PROPOR 2026, found that fine-tuning consistently improved correlations with reference scores and enhanced vector retrieval performance. Furthermore, the choice of metric (cosine similarity or inner product) in the vector retrieval index influenced partitioning granularity, highlighting the need for joint calibration of the encoder, metric, and threshold. Following auditing by specialists, LexIris-pt and LexBert-pt were adopted to aid in screening and organizing repetitive claims and predatory litigation.
Key takeaway
For legal professionals or computational linguists working with Brazilian Portuguese legal texts, you should consider implementing specialized BERT-based sentence embeddings like LexIris-pt or LexBert-pt. Fine-tuning these models on domain-specific data, such as initial petitions, demonstrably improves performance in tasks like identifying repetitive claims and predatory litigation, streamlining legal document processing. Ensure proper calibration of your chosen similarity metric and threshold for optimal results in vector retrieval.
Key insights
Fine-tuning BERT-based models significantly enhances legal sentence embeddings for Brazilian Portuguese.
Principles
- Fine-tuning improves correlation and retrieval.
- Metric choice impacts clustering granularity.
- Joint calibration of encoder, metric, threshold is crucial.
Method
Supervised fine-tuning of BERT-based models using initial petition pairs, followed by evaluation via zero-shot inference, fine-tuning, and incremental clustering on a 20,000-petition corpus.
In practice
- Use LexIris-pt or LexBert-pt for legal text.
- Calibrate encoder, metric, threshold for retrieval.
- Apply fine-tuning for improved legal similarity.
Topics
- LexIris-pt
- LexBert-pt
- Sentence Embeddings
- Brazilian Portuguese Legal Domain
- BERT-based Models
Best for: Research Scientist, AI Scientist, NLP Engineer, Legal Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.