LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Summary
LegalBench-BR is introduced as the first public benchmark for evaluating language models on Brazilian legal text classification. This dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas using LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, which updates only 0.3% of model parameters, achieved 87.6% accuracy and 0.87 macro-F1. This performance significantly surpasses commercial LLMs, with Claude 3.5 Haiku scoring +22pp lower and GPT-4o mini scoring +28pp lower. Notably, GPT-4o mini scored F1 = 0.00 and Claude 3.5 Haiku scored F1 = 0.08 on "administrativo" (administrative law), while the fine-tuned model reached F1 = 0.91. Commercial LLMs showed a bias towards "civel" (civil law), misclassifying ambiguous cases, a problem eliminated by domain-adapted fine-tuning.
Key takeaway
For AI Engineers developing legal tech solutions in Brazil, relying solely on general-purpose large language models for classification tasks is insufficient. Your models will exhibit systematic biases and poor performance on specific legal domains like administrative law. Instead, you should implement LoRA fine-tuning on models like BERTimbau using domain-specific datasets such as LegalBench-BR to achieve superior accuracy and F1 scores, even on consumer GPUs, at zero marginal inference cost.
Key insights
Domain-adapted fine-tuning significantly outperforms general LLMs for specialized legal text classification.
Principles
- General LLMs struggle with domain-specific legal nuances.
- LoRA fine-tuning can close performance gaps efficiently.
Method
The method involves collecting 3,105 appellate proceedings, annotating them across five legal areas using LLM-assisted labeling with heuristic validation, and then fine-tuning a BERTimbau-LoRA model.
In practice
- Use LoRA fine-tuning for legal NLP tasks.
- Prioritize domain-specific models over general LLMs.
Topics
- LegalBench-BR
- Brazilian Legal Classification
- Large Language Models
- LoRA Fine-tuning
- BERTimbau
Best for: AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.