Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A systematic human inspection of the NL-to-FOL benchmarks FOLIO and MALLS revealed significant annotation errors. Approximately 39% of the FOLIO validation split and 36% of MALLS test instances contain incorrect First-Order Logic formalizations. Furthermore, 16.4% and 48% of entries had ambiguous natural language sentences, with FOLIO also showing 8.4% incorrect Natural Language Inference labels. These errors substantially distort model evaluations; testing Gemma 4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini with corrected ground truths resulted in accuracy gains ranging from +9 to +22 percentage points. To address this, an LLM-based framework was developed to guide human reviewers, empirically demonstrating that 90% dataset accuracy can be achieved by reviewing under 24% of instances, a notable improvement over the 70% required by unguided review. All verified annotations and the framework's code have been released.

Key takeaway

For NLP Engineers evaluating LLMs on neurosymbolic AI tasks or Natural Language Inference, recognize that widely used NL-to-FOL benchmarks like FOLIO and MALLS contain substantial errors. Your model evaluations may be significantly understated by flawed ground truths. You should prioritize auditing your training and validation data, considering LLM-assisted frameworks to efficiently identify and correct annotation errors, thereby ensuring more accurate and reliable performance metrics for your systems.

Key insights

NL-to-FOL benchmarks contain significant errors, distorting model evaluation, but LLM-assisted review can efficiently correct them.

Principles

Method

An LLM-based framework identifies and prioritizes error-prone instances in NL-to-FOL datasets, guiding human reviewers to achieve high accuracy with minimal effort.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.