NLP-based Page Classification for Efficient LLM Extraction from Brazilian Public Tender Documents
Summary
A two-stage pipeline has been developed to efficiently extract product information from lengthy Brazilian public tender documents (editais de licitação) using Large Language Models (LLMs). This approach addresses the computational expense and accuracy degradation LLMs face with large inputs by combining NLP-based page classification with LLM extraction. Researchers created a new dataset comprising 11,190 annotated pages from 350 documents across five product domains. Experiments compared transformer-based classifiers like BERTimbau and DistilBERT against classical machine learning models utilizing engineered features. XGBoost, when paired with domain-specific features, achieved a 97.75% F1-score, surpassing fine-tuned BERT models by more than 4 percentage points. The complete pipeline reduces LLM input tokens by 64-88% while preserving extraction completeness, facilitating cost-effective document processing at scale.
Key takeaway
For AI Engineers and Research Scientists working with LLMs on large, structured documents, consider implementing a pre-classification stage. This method, demonstrated to reduce LLM input tokens by 64-88% while maintaining completeness, can drastically cut computational costs and improve processing efficiency. Prioritize classical machine learning models like XGBoost with tailored feature engineering for the classification task, as they may outperform fine-tuned transformer models in specific document domains.
Key insights
Combining NLP page classification with LLM extraction significantly reduces token count and cost for document processing.
Principles
- Feature engineering can outperform fine-tuned transformers.
- Pre-classification improves LLM efficiency and accuracy.
Method
A two-stage pipeline classifies document pages using XGBoost with domain-specific features, then applies LLM extraction only to relevant pages, reducing input tokens by 64-88%.
In practice
- Use XGBoost for document page classification.
- Develop domain-specific features for NLP tasks.
- Pre-filter LLM inputs to cut costs.
Topics
- NLP Page Classification
- LLM Information Extraction
- Brazilian Public Tenders
- XGBoost Performance
- Document Processing Pipeline
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.