How OpenAI Scaled to 800 Million Users With Postgres

· Source: ByteByteGo Newsletter · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure, Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

OpenAI successfully scaled a single-primary PostgreSQL database to handle millions of queries per second for 800 million ChatGPT users, achieving five-nines availability and low double-digit millisecond latency. This contradicts conventional wisdom that typically recommends sharding for such scale. Over the past year, their database load increased by over 10X, necessitating systematic optimization across the stack. Their strategy focused on minimizing primary database load by offloading read traffic and migrating write-heavy workloads to sharded systems like Azure Cosmos DB, alongside application-level write optimizations. They also implemented query and connection optimizations, including avoiding complex multi-table joins, reviewing ORM-generated SQL, and deploying PgBouncer for connection pooling. Furthermore, OpenAI prevented cascading failures through cache locking, rate limiting, and workload isolation, while addressing PostgreSQL's MVCC and schema change constraints by migrating write-heavy tasks and enforcing strict schema modification rules. High availability is maintained with hot standbys and multiple read replicas, with future plans for cascading replication.

Key takeaway

For MLOps Engineers or Data Engineers managing large-scale, read-heavy applications, consider thoroughly optimizing your PostgreSQL deployment before resorting to complex sharding. Your team should focus on offloading read traffic, migrating write-heavy components, and implementing robust connection pooling and query optimization. This approach can significantly extend PostgreSQL's scalability, potentially delaying or avoiding the operational overhead of sharding, as demonstrated by OpenAI's success with ChatGPT.

Key insights

PostgreSQL can scale to massive read-heavy workloads with rigorous optimization, challenging conventional sharding wisdom.

Principles

Method

OpenAI scaled PostgreSQL by minimizing primary load, optimizing queries and connections with PgBouncer, preventing cascading failures via cache locking and rate limiting, and addressing MVCC and schema change constraints.

In practice

Topics

Best for: Data Engineer, MLOps Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.