LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

LegalBench-BR is introduced as the first public benchmark for evaluating language models on Brazilian legal text classification. This dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas using LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, which updates only 0.3% of model parameters, achieved 87.6% accuracy and 0.87 macro-F1. This performance significantly surpasses commercial LLMs, with Claude 3.5 Haiku scoring +22pp lower and GPT-4o mini scoring +28pp lower. Notably, GPT-4o mini scored F1 = 0.00 and Claude 3.5 Haiku scored F1 = 0.08 on "administrativo" (administrative law), while the fine-tuned model reached F1 = 0.91. Commercial LLMs showed a bias towards "civel" (civil law), misclassifying ambiguous cases, a problem eliminated by domain-adapted fine-tuning.

Key takeaway

For AI Engineers developing legal tech solutions in Brazil, relying solely on general-purpose large language models for classification tasks is insufficient. Your models will exhibit systematic biases and poor performance on specific legal domains like administrative law. Instead, you should implement LoRA fine-tuning on models like BERTimbau using domain-specific datasets such as LegalBench-BR to achieve superior accuracy and F1 scores, even on consumer GPUs, at zero marginal inference cost.

Key insights

Domain-adapted fine-tuning significantly outperforms general LLMs for specialized legal text classification.

Principles

General LLMs struggle with domain-specific legal nuances.
LoRA fine-tuning can close performance gaps efficiently.

Method

The method involves collecting 3,105 appellate proceedings, annotating them across five legal areas using LLM-assisted labeling with heuristic validation, and then fine-tuning a BERTimbau-LoRA model.

In practice

Use LoRA fine-tuning for legal NLP tasks.
Prioritize domain-specific models over general LLMs.

Topics

LegalBench-BR
Brazilian Legal Classification
Large Language Models
LoRA Fine-tuning
BERTimbau

Best for: AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.