Meta’s AI Storage Blueprint at Scale

· Source: Engineering at Meta · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

Meta has significantly evolved its BLOB-storage architecture to address critical bottlenecks in AI workloads, which are characterized by exponential model growth and increasing dataset sizes. The company operates hundreds of exabyte-scale storage clusters supporting all its products. Recognizing that AI compute performance triples every two years while storage growth is modest, Meta redesigned its legacy BLOB storage, which previously caused GPU stalls and hindered research velocity. The new architecture features a unified metadata schema backed by ZippyDB for O(1) lookups, eliminates the dataplane proxy with a fat client SDK for direct streaming, and deploys regionally colocated with GPUs. Further optimizations include a distributed data cache, leveraging Meta's Owl subsystem, achieving an 80% average cache hit rate, and a read-plan metadata cache providing 1-2 ms access. These changes aim to maximize GPU utilization and accelerate research iteration by enabling on-demand data hydration across a tiered caching architecture.

Key takeaway

For AI Architects designing large-scale training infrastructure, you should prioritize storage architectures that minimize GPU stalls and accelerate research iteration. Consider adopting a unified metadata schema and client-side direct data streaming to reduce latency. Implementing a tiered caching system with prefetching and on-demand data hydration can significantly improve GPU utilization and reduce cross-region data ingestion times, allowing your researchers to focus on model tuning rather than data movement.

Key insights

Meta's BLOB storage evolved to optimize AI workloads by addressing GPU stalls and accelerating research velocity through architectural redesign.

Principles

Method

Meta rebuilt its BLOB storage with a unified metadata schema, eliminated dataplane proxies, and implemented a tiered caching system with prefetching and on-demand data hydration.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Engineering at Meta.