Development and Evaluation of a Hybrid Information Retrieval System Applied to the Brazilian Legal Domain
Summary
A hybrid information retrieval system, combining the BM25L algorithm and the BumbaLM language model, has been developed and evaluated for the Brazilian legal domain. This system addresses the limitations of traditional information retrieval systems, which struggle with vocabulary incompatibility and the extensive length of legal texts. While Transformer-based models can capture semantic nuances, they often face input size constraints that lead to information loss when processing long documents. The proposed hybrid approach aims to overcome these challenges, enhancing process management, automating tasks, and reducing the inefficiencies prevalent in judicial systems. This work was presented by Ana Carolina C. Bessa, Fábio M. F. Lobato, and Antonio F. L. J. Junior at the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) in Salvador, Brazil, appearing on pages 186–190 of Volume 2.
Key takeaway
For NLP Engineers working with legal or other long-document domains, consider adopting a hybrid information retrieval strategy. Your systems can mitigate the input size constraints of Transformer models and the vocabulary limitations of traditional methods by combining algorithms like BM25L with domain-specific language models such as BumbaLM. This approach can significantly improve the accuracy and efficiency of legal document processing and judicial task automation.
Key insights
Hybrid IR systems combining traditional and Transformer models can overcome long-text limitations in specialized domains.
Principles
- Traditional IR struggles with legal text length.
- Transformers face input size constraints.
- Hybrid models improve legal domain retrieval.
Method
The proposed method combines the BM25L algorithm with the BumbaLM language model to create a hybrid information retrieval system specifically for the Brazilian legal domain.
In practice
- Apply BM25L for initial retrieval.
- Integrate BumbaLM for semantic understanding.
- Target long, specialized texts like legal documents.
Topics
- Hybrid Information Retrieval
- Brazilian Legal Domain
- BM25L Algorithm
- BumbaLM Language Model
- Legal Text Processing
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.