Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval
Summary
A novel system introduces "Executable Schema Contracts" to automate question answering over heterogeneous, multi-source data like tables, documents, and JSON files. This system automatically discovers a unified, executable schema from raw data, which then acts as a shared contract for knowledge graph construction and query-time retrieval. It employs a closed-world field catalog to constrain LLM-based schema discovery and uses deterministic structural analysis to infer identity and foreign keys. The schema guides data extraction, deduplication, and cross-source linking into a provenance-aware Neo4j knowledge graph. At query time, the schema conditions a multi-tool agent, routing queries across structured lookup, graph traversal, and vector search. In controlled zero-shot comparisons, the system improved Exact Match (EM) by +2.6 to +19.8 points over baselines across four QA benchmarks, including BlendQA, HybridQA, TAT-QA, and ComplexTR, showing consistent gains across GPT-4.1, Claude Haiku, and Llama 3.3 70B models.
Key takeaway
For AI Engineers and Architects building robust question-answering systems over diverse, evolving data, you should prioritize adopting schema-guided approaches. This system demonstrates that automatically induced, executable schemas significantly improve accuracy for cross-source joins and multi-hop retrieval by providing a unified contract for ingestion and query routing. Implement a closed-world field catalog to ground LLM-discovered schemas and leverage statistical analysis for structural intelligence. While query-time schema extension can add ~3.4 seconds latency, the overall gains in accuracy and auditable provenance make this a compelling architecture for complex data environments.
Key insights
An automatically induced, executable schema can unify knowledge graph ingestion and multi-source query-time retrieval.
Principles
- Closed-world schema induction ensures LLM-discovered structure is executable.
- Prioritize statistical structural analysis before semantic extraction.
- Schema-conditioned routing significantly improves retrieval agent performance.
Method
The system couples schema profiling, LLM extraction, KG construction, and vector retrieval, using a single induced schema to constrain extraction, link entities, and guide agent routing.
In practice
- Automate schema discovery and maintenance for evolving data.
- Enhance cross-source joins and multi-hop QA with auditable provenance.
- Ground LLM schema outputs using a closed-world field catalog.
Topics
- Executable Schema Contracts
- Knowledge Graph Construction
- Multi-Source Retrieval
- LLM-based Schema Discovery
- Question Answering
- Retrieval-Augmented Generation
Code references
Best for: Research Scientist, Machine Learning Engineer, NLP Engineer, AI Scientist, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.