DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Reasoning & Knowledge Representation · Depth: Expert, quick

Summary

DeFAb (Defeasible Abduction Benchmark) is a new dataset and generation pipeline designed to evaluate foundation models' ability to perform defeasible abduction, which involves constructing hypotheses to explain anomalies by overriding defaults while preserving unrelated expectations. The benchmark converts four decades of publicly funded knowledge bases, including OpenCyc, YAGO, Wikidata, ConceptNet, and UMLS, into 372,648+ formally grounded instances across 33.75M materialized rules from 18 sources, structured in three verifiable levels. A rule-based logic solver achieves 100% accuracy in under 50 microseconds. In contrast, frontier language models struggle, reaching a maximum of 65% accuracy, which drops to 23.5% under rendering-robust evaluation. Level 2 accuracy for LMs ranges from 7.8-23.5%, and a significant +19.4 pp Level 3 gap was observed due to contamination. The release also includes DeFAb-Hard, a 235-instance variant where the best model scores 53.3%, and CONJURE, a 560-instance kernel-verified benchmark for transformative creativity.

Key takeaway

For AI Scientists and Machine Learning Engineers developing advanced reasoning capabilities, you should recognize that current foundation models severely lack robust defeasible abduction skills. Your focus should shift towards integrating symbolic reasoning or developing novel architectures that can reliably construct verifiable hypotheses, as demonstrated by the 100% accuracy of rule-based solvers versus the 23.5% best LM performance on DeFAb. Consider using DeFAb and CONJURE as rigorous benchmarks to guide your model development and fine-tuning efforts.

Key insights

Foundation models significantly underperform symbolic logic solvers on verifiable defeasible abduction tasks, highlighting a reasoning gap.

Principles

Logical rigor can measure creative reasoning.
Verifiable gold standards are crucial for benchmarks.
Contamination controls isolate true reasoning gaps.

Method

The DeFAb pipeline pairs taxonomic hierarchies with behavioral property graphs to generate formally grounded, verifiable instances for defeasible abduction, ensuring polynomial-time checks for derivation, conservativity, and minimality.

In practice

Use DeFAb to benchmark model theoretical reasoning.
Apply CONJURE for evaluating transformative creativity.
Employ verifiers for DPO/RLVR exact reward.

Topics

Defeasible Abduction
Foundation Models
Knowledge Bases
Benchmark Datasets
Symbolic Reasoning
Model Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.