LexIris-pt and LexBert-pt: Specialized Sentence Embeddings for Legal Similarity in Brazilian Portuguese

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

LexIris-pt and LexBert-pt are two specialized sentence embedding models designed for the Portuguese legal domain, developed through supervised fine-tuning of BERT-based models. Researchers evaluated these models using a comparative protocol involving three stages: zero-shot inference with pretrained embeddings, supervised fine-tuning on pairs of initial petitions, and vector retrieval with incremental clustering over a corpus of 20,000 initial petitions. The study, presented at PROPOR 2026, found that fine-tuning consistently improved correlations with reference scores and enhanced vector retrieval performance. Furthermore, the choice of metric (cosine similarity or inner product) in the vector retrieval index influenced partitioning granularity, highlighting the need for joint calibration of the encoder, metric, and threshold. Following auditing by specialists, LexIris-pt and LexBert-pt were adopted to aid in screening and organizing repetitive claims and predatory litigation.

Key takeaway

For legal professionals or computational linguists working with Brazilian Portuguese legal texts, you should consider implementing specialized BERT-based sentence embeddings like LexIris-pt or LexBert-pt. Fine-tuning these models on domain-specific data, such as initial petitions, demonstrably improves performance in tasks like identifying repetitive claims and predatory litigation, streamlining legal document processing. Ensure proper calibration of your chosen similarity metric and threshold for optimal results in vector retrieval.

Key insights

Fine-tuning BERT-based models significantly enhances legal sentence embeddings for Brazilian Portuguese.

Principles

Fine-tuning improves correlation and retrieval.
Metric choice impacts clustering granularity.
Joint calibration of encoder, metric, threshold is crucial.

Method

Supervised fine-tuning of BERT-based models using initial petition pairs, followed by evaluation via zero-shot inference, fine-tuning, and incremental clustering on a 20,000-petition corpus.

In practice

Use LexIris-pt or LexBert-pt for legal text.
Calibrate encoder, metric, threshold for retrieval.
Apply fine-tuning for improved legal similarity.

Topics

LexIris-pt
LexBert-pt
Sentence Embeddings
Brazilian Portuguese Legal Domain
BERT-based Models

Best for: Research Scientist, AI Scientist, NLP Engineer, Legal Professional

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.