DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Reasoning & Knowledge Representation · Depth: Expert, quick

Summary

DeFAb (Defeasible Abduction Benchmark) is a new dataset and generation pipeline designed to evaluate foundation models' ability to perform defeasible abduction, which involves constructing hypotheses to explain anomalies by overriding defaults while preserving unrelated expectations. The benchmark converts four decades of publicly funded knowledge bases, including OpenCyc, YAGO, Wikidata, ConceptNet, and UMLS, into 372,648+ formally grounded instances across 33.75M materialized rules from 18 sources, structured in three verifiable levels. A rule-based logic solver achieves 100% accuracy in under 50 microseconds. In contrast, frontier language models struggle, reaching a maximum of 65% accuracy, which drops to 23.5% under rendering-robust evaluation. Level 2 accuracy for LMs ranges from 7.8-23.5%, and a significant +19.4 pp Level 3 gap was observed due to contamination. The release also includes DeFAb-Hard, a 235-instance variant where the best model scores 53.3%, and CONJURE, a 560-instance kernel-verified benchmark for transformative creativity.

Key takeaway

For AI Scientists and Machine Learning Engineers developing advanced reasoning capabilities, you should recognize that current foundation models severely lack robust defeasible abduction skills. Your focus should shift towards integrating symbolic reasoning or developing novel architectures that can reliably construct verifiable hypotheses, as demonstrated by the 100% accuracy of rule-based solvers versus the 23.5% best LM performance on DeFAb. Consider using DeFAb and CONJURE as rigorous benchmarks to guide your model development and fine-tuning efforts.

Key insights

Foundation models significantly underperform symbolic logic solvers on verifiable defeasible abduction tasks, highlighting a reasoning gap.

Principles

Method

The DeFAb pipeline pairs taxonomic hierarchies with behavioral property graphs to generate formally grounded, verifiable instances for defeasible abduction, ensuring polynomial-time checks for derivation, conservativity, and minimality.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.