How OpenAI Built Its Data Agent

2026-06-03 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Intermediate, extended

Summary

OpenAI's data platform, managing 1.5 exabytes across 90,000 datasets for approximately 4,000 internal users by May 2026, developed an in-house data agent to address the challenge of efficiently finding and understanding relevant data tables. This agent, described as "vanilla" in architecture, utilizes GPT-5.5 within a simple harness, relying heavily on a robust data infrastructure and six layers of context assembly for reliability. It allows users to ask natural language questions in platforms like Slack, generating SQL, executing queries, and returning verified answers with reasoning. Beyond data querying, OpenAI leverages Codex for critical internal operations, including a rapid cross-cloud migration of 90,000 tables and 600 petabytes in two months, automating open-source patch releases without human intervention, and streamlining support ticket triage by investigating and applying fixes.

Key takeaway

For Data Engineers or MLOps teams building internal data agents, prioritize establishing a unified, well-annotated data foundation over complex agent architectures. Your agent's reliability hinges on high-quality context assembly, not an intricate LLM setup. Focus on curating a minimal, distinct set of tools and leveraging trusted historical data for retrieval. This approach enables rapid, large-scale operations like migrations and automated patching, significantly accelerating project timelines and amplifying engineering capacity.

Key insights

The reliability of AI agents at scale stems from strong data foundations and carefully engineered context, not complex agent architectures.

Principles

Data foundation trumps agent complexity.
Fewer, distinct tools improve agent performance.
Prioritize trusted data for retrieval.

Method

The data agent converts natural language questions into vectors, assembles context from six layers (metadata, human annotations, Codex enrichment, institutional knowledge, memory, runtime context), then enters an LLM loop to generate, execute, and verify SQL queries.

In practice

Consolidate data infrastructure before agent deployment.
Curate agent tools to avoid functional overlap.
Rank historical queries by trustworthiness for context.

Topics

OpenAI Data Agent
Data Platform Engineering
LLM Agent Architecture
Data Context Management
Codex Use Cases
Cloud Data Migration

Best for: MLOps Engineer, Data Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.