Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Summary
A systematic human inspection of the NL-to-FOL benchmarks FOLIO and MALLS revealed significant annotation errors. Approximately 39% of the FOLIO validation split and 36% of MALLS test instances contain incorrect First-Order Logic formalizations. Furthermore, 16.4% and 48% of entries had ambiguous natural language sentences, with FOLIO also showing 8.4% incorrect Natural Language Inference labels. These errors substantially distort model evaluations; testing Gemma 4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini with corrected ground truths resulted in accuracy gains ranging from +9 to +22 percentage points. To address this, an LLM-based framework was developed to guide human reviewers, empirically demonstrating that 90% dataset accuracy can be achieved by reviewing under 24% of instances, a notable improvement over the 70% required by unguided review. All verified annotations and the framework's code have been released.
Key takeaway
For NLP Engineers evaluating LLMs on neurosymbolic AI tasks or Natural Language Inference, recognize that widely used NL-to-FOL benchmarks like FOLIO and MALLS contain substantial errors. Your model evaluations may be significantly understated by flawed ground truths. You should prioritize auditing your training and validation data, considering LLM-assisted frameworks to efficiently identify and correct annotation errors, thereby ensuring more accurate and reliable performance metrics for your systems.
Key insights
NL-to-FOL benchmarks contain significant errors, distorting model evaluation, but LLM-assisted review can efficiently correct them.
Principles
- High-quality ground truth data is essential for reliable model evaluation.
- LLMs can significantly enhance the efficiency of human data annotation review.
Method
An LLM-based framework identifies and prioritizes error-prone instances in NL-to-FOL datasets, guiding human reviewers to achieve high accuracy with minimal effort.
In practice
- Audit existing NL-to-FOL datasets for formalization and ambiguity errors.
- Employ LLM-guided review to efficiently improve dataset accuracy for neurosymbolic AI.
Topics
- NL-to-FOL
- Dataset Verification
- LLM-assisted Annotation
- Neurosymbolic AI
- Natural Language Inference
- Data Quality
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.