How OpenAI Built Its Data Agent

· Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Intermediate, extended

Summary

OpenAI's data platform, managing 1.5 exabytes across 90,000 datasets for approximately 4,000 internal users by May 2026, developed an in-house data agent to address the challenge of efficiently finding and understanding relevant data tables. This agent, described as "vanilla" in architecture, utilizes GPT-5.5 within a simple harness, relying heavily on a robust data infrastructure and six layers of context assembly for reliability. It allows users to ask natural language questions in platforms like Slack, generating SQL, executing queries, and returning verified answers with reasoning. Beyond data querying, OpenAI leverages Codex for critical internal operations, including a rapid cross-cloud migration of 90,000 tables and 600 petabytes in two months, automating open-source patch releases without human intervention, and streamlining support ticket triage by investigating and applying fixes.

Key takeaway

For Data Engineers or MLOps teams building internal data agents, prioritize establishing a unified, well-annotated data foundation over complex agent architectures. Your agent's reliability hinges on high-quality context assembly, not an intricate LLM setup. Focus on curating a minimal, distinct set of tools and leveraging trusted historical data for retrieval. This approach enables rapid, large-scale operations like migrations and automated patching, significantly accelerating project timelines and amplifying engineering capacity.

Key insights

The reliability of AI agents at scale stems from strong data foundations and carefully engineered context, not complex agent architectures.

Principles

Method

The data agent converts natural language questions into vectors, assembles context from six layers (metadata, human annotations, Codex enrichment, institutional knowledge, memory, runtime context), then enters an LLM loop to generate, execute, and verify SQL queries.

In practice

Topics

Best for: MLOps Engineer, Data Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.