DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

DW-Bench is a new benchmark designed to evaluate Large Language Models (LLMs) on their ability to perform graph topology reasoning over data warehouse schemas. Unlike traditional Text-to-SQL benchmarks, DW-Bench focuses on tasks like tracing foreign key paths, detecting disconnected schema silos, and propagating impact through ETL lineage chains. It comprises 1,046 questions across 13 subtypes, three difficulty levels, and five datasets, including real-world and synthetic schemas. The benchmark evaluates six baselines, ranging from flat context injection to tool-calling and code execution, using Gemini 2.5 Flash, DeepSeek-V3, and Qwen2.5-72B. Results indicate that tool-augmented baselines achieve 87–90% micro-EM, significantly outperforming static methods (63–81%). A key finding is that LLMs struggle with compositional multi-hop tasks, particularly the "combined_impact" subtype, where no method exceeds 61% EM, highlighting a structural reasoning ceiling.

Key takeaway

For AI engineers developing LLM applications for data warehouse management, you should prioritize integrating robust graph reasoning tools. While LLMs show strong performance on single-hop and simple multi-hop queries when augmented with tools, their ability to perform complex, compositional graph reasoning, such as chaining lineage traversal with foreign key expansion, remains a significant challenge. Focus on developing hybrid architectures that combine interactive tools with learned graph representations to bridge this compositional reasoning gap.

Key insights

LLMs excel at orchestrating graph algorithms via tools but struggle with compositional multi-hop graph reasoning.

Principles

Method

DW-Bench generates questions deterministically from graph structures using NetworkX, categorizing them by structural complexity (easy, medium, hard) and employing an obfuscation protocol to randomize table names.

In practice

Topics

Code references

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.