Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval

2026-01-26 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A novel system introduces "Executable Schema Contracts" to automate question answering over heterogeneous, multi-source data like tables, documents, and JSON files. This system automatically discovers a unified, executable schema from raw data, which then acts as a shared contract for knowledge graph construction and query-time retrieval. It employs a closed-world field catalog to constrain LLM-based schema discovery and uses deterministic structural analysis to infer identity and foreign keys. The schema guides data extraction, deduplication, and cross-source linking into a provenance-aware Neo4j knowledge graph. At query time, the schema conditions a multi-tool agent, routing queries across structured lookup, graph traversal, and vector search. In controlled zero-shot comparisons, the system improved Exact Match (EM) by +2.6 to +19.8 points over baselines across four QA benchmarks, including BlendQA, HybridQA, TAT-QA, and ComplexTR, showing consistent gains across GPT-4.1, Claude Haiku, and Llama 3.3 70B models.

Key takeaway

For AI Engineers and Architects building robust question-answering systems over diverse, evolving data, you should prioritize adopting schema-guided approaches. This system demonstrates that automatically induced, executable schemas significantly improve accuracy for cross-source joins and multi-hop retrieval by providing a unified contract for ingestion and query routing. Implement a closed-world field catalog to ground LLM-discovered schemas and leverage statistical analysis for structural intelligence. While query-time schema extension can add ~3.4 seconds latency, the overall gains in accuracy and auditable provenance make this a compelling architecture for complex data environments.

Key insights

An automatically induced, executable schema can unify knowledge graph ingestion and multi-source query-time retrieval.

Principles

Closed-world schema induction ensures LLM-discovered structure is executable.
Prioritize statistical structural analysis before semantic extraction.
Schema-conditioned routing significantly improves retrieval agent performance.

Method

The system couples schema profiling, LLM extraction, KG construction, and vector retrieval, using a single induced schema to constrain extraction, link entities, and guide agent routing.

In practice

Automate schema discovery and maintenance for evolving data.
Enhance cross-source joins and multi-hop QA with auditable provenance.
Ground LLM schema outputs using a closed-world field catalog.

Topics

Executable Schema Contracts
Knowledge Graph Construction
Multi-Source Retrieval
LLM-based Schema Discovery
Question Answering
Retrieval-Augmented Generation

Code references

snap-stanford/STaRK

Best for: Research Scientist, Machine Learning Engineer, NLP Engineer, AI Scientist, AI Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.