NLP-based Page Classification for Efficient LLM Extraction from Brazilian Public Tender Documents

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A two-stage pipeline has been developed to efficiently extract product information from lengthy Brazilian public tender documents (editais de licitação) using Large Language Models (LLMs). This approach addresses the computational expense and accuracy degradation LLMs face with large inputs by combining NLP-based page classification with LLM extraction. Researchers created a new dataset comprising 11,190 annotated pages from 350 documents across five product domains. Experiments compared transformer-based classifiers like BERTimbau and DistilBERT against classical machine learning models utilizing engineered features. XGBoost, when paired with domain-specific features, achieved a 97.75% F1-score, surpassing fine-tuned BERT models by more than 4 percentage points. The complete pipeline reduces LLM input tokens by 64-88% while preserving extraction completeness, facilitating cost-effective document processing at scale.

Key takeaway

For AI Engineers and Research Scientists working with LLMs on large, structured documents, consider implementing a pre-classification stage. This method, demonstrated to reduce LLM input tokens by 64-88% while maintaining completeness, can drastically cut computational costs and improve processing efficiency. Prioritize classical machine learning models like XGBoost with tailored feature engineering for the classification task, as they may outperform fine-tuned transformer models in specific document domains.

Key insights

Combining NLP page classification with LLM extraction significantly reduces token count and cost for document processing.

Principles

Feature engineering can outperform fine-tuned transformers.
Pre-classification improves LLM efficiency and accuracy.

Method

A two-stage pipeline classifies document pages using XGBoost with domain-specific features, then applies LLM extraction only to relevant pages, reducing input tokens by 64-88%.

In practice

Use XGBoost for document page classification.
Develop domain-specific features for NLP tasks.
Pre-filter LLM inputs to cut costs.

Topics

NLP Page Classification
LLM Information Extraction
Brazilian Public Tenders
XGBoost Performance
Document Processing Pipeline

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.