Information Extraction from Electricity Invoices with General-Purpose Large Language Models
Summary
A study evaluated the capability of general-purpose Large Language Models (LLMs) to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, researchers benchmarked Gemini 1.5 Pro and Mistral-small across 19 parameter configurations and 6 prompting strategies. The experimental framework treated prompt engineering as the primary variable, comparing zero-shot baselines against few-shot and iterative extraction strategies. Results showed that prompt quality significantly outweighed hyperparameter tuning, with F1-score variation across parameter configurations being marginal (0.58 percentage points), while the gap between zero-shot and the best few-shot strategy exceeded 19 percentage points. The best configuration, few-shot with cross-validation, achieved an F1-score of 97.61% for Gemini 1.5 Pro and 96.11% for Mistral-small. Document template structure emerged as the primary determinant of extraction difficulty, and LLMs demonstrated robust generalization to unseen layouts, outperforming classical machine learning methods.
Key takeaway
For NLP Engineers and Research Scientists developing document automation solutions, this research indicates that investing in sophisticated prompt engineering strategies, particularly few-shot with cross-validation, will yield substantially greater improvements in extraction accuracy than fine-tuning inference parameters. You should prioritize robust prompt design and consider how document structure impacts preprocessing, as these factors are critical for maximizing the fidelity and generalization of LLM-based information extraction from semi-structured documents like invoices.
Key insights
Prompt engineering is the critical lever for high-fidelity information extraction using general-purpose LLMs on semi-structured documents.
Principles
- Prompt quality dominates over hyperparameter tuning.
- Few-shot prompting significantly improves extraction over zero-shot.
- Document template structure impacts extraction difficulty.
Method
The study used a subset of the IDSEM dataset, converting PDF invoices to Markdown. It benchmarked Gemini 1.5 Pro and Mistral-small with various zero-shot, few-shot, and iterative prompting strategies, evaluating F1-score.
In practice
- Use deterministic settings (Temperature = 0) for reproducibility.
- Prioritize few-shot prompting over zero-shot for accuracy.
- Convert PDFs to Markdown to preserve semantic structure.
Topics
- Information Extraction
- Large Language Models
- Prompt Engineering
- Electricity Invoices
- IDSEM Dataset
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.