DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models
Summary
DeFAb (Defeasible Abduction Benchmark) is a new dataset and generation pipeline designed to evaluate foundation models' ability to perform defeasible abduction, which involves constructing hypotheses to explain anomalies by overriding defaults while preserving unrelated expectations. The benchmark converts four decades of publicly funded knowledge bases, including OpenCyc, YAGO, Wikidata, ConceptNet, and UMLS, into 372,648+ formally grounded instances across 33.75M materialized rules from 18 sources, structured in three verifiable levels. A rule-based logic solver achieves 100% accuracy in under 50 microseconds. In contrast, frontier language models struggle, reaching a maximum of 65% accuracy, which drops to 23.5% under rendering-robust evaluation. Level 2 accuracy for LMs ranges from 7.8-23.5%, and a significant +19.4 pp Level 3 gap was observed due to contamination. The release also includes DeFAb-Hard, a 235-instance variant where the best model scores 53.3%, and CONJURE, a 560-instance kernel-verified benchmark for transformative creativity.
Key takeaway
For AI Scientists and Machine Learning Engineers developing advanced reasoning capabilities, you should recognize that current foundation models severely lack robust defeasible abduction skills. Your focus should shift towards integrating symbolic reasoning or developing novel architectures that can reliably construct verifiable hypotheses, as demonstrated by the 100% accuracy of rule-based solvers versus the 23.5% best LM performance on DeFAb. Consider using DeFAb and CONJURE as rigorous benchmarks to guide your model development and fine-tuning efforts.
Key insights
Foundation models significantly underperform symbolic logic solvers on verifiable defeasible abduction tasks, highlighting a reasoning gap.
Principles
- Logical rigor can measure creative reasoning.
- Verifiable gold standards are crucial for benchmarks.
- Contamination controls isolate true reasoning gaps.
Method
The DeFAb pipeline pairs taxonomic hierarchies with behavioral property graphs to generate formally grounded, verifiable instances for defeasible abduction, ensuring polynomial-time checks for derivation, conservativity, and minimality.
In practice
- Use DeFAb to benchmark model theoretical reasoning.
- Apply CONJURE for evaluating transformative creativity.
- Employ verifiers for DPO/RLVR exact reward.
Topics
- Defeasible Abduction
- Foundation Models
- Knowledge Bases
- Benchmark Datasets
- Symbolic Reasoning
- Model Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.