Meta’s AI Storage Blueprint at Scale
Summary
Meta has significantly evolved its BLOB-storage architecture to address critical bottlenecks in AI workloads, which are characterized by exponential model growth and increasing dataset sizes. The company operates hundreds of exabyte-scale storage clusters supporting all its products. Recognizing that AI compute performance triples every two years while storage growth is modest, Meta redesigned its legacy BLOB storage, which previously caused GPU stalls and hindered research velocity. The new architecture features a unified metadata schema backed by ZippyDB for O(1) lookups, eliminates the dataplane proxy with a fat client SDK for direct streaming, and deploys regionally colocated with GPUs. Further optimizations include a distributed data cache, leveraging Meta's Owl subsystem, achieving an 80% average cache hit rate, and a read-plan metadata cache providing 1-2 ms access. These changes aim to maximize GPU utilization and accelerate research iteration by enabling on-demand data hydration across a tiered caching architecture.
Key takeaway
For AI Architects designing large-scale training infrastructure, you should prioritize storage architectures that minimize GPU stalls and accelerate research iteration. Consider adopting a unified metadata schema and client-side direct data streaming to reduce latency. Implementing a tiered caching system with prefetching and on-demand data hydration can significantly improve GPU utilization and reduce cross-region data ingestion times, allowing your researchers to focus on model tuning rather than data movement.
Key insights
Meta's BLOB storage evolved to optimize AI workloads by addressing GPU stalls and accelerating research velocity through architectural redesign.
Principles
- AI storage needs predictable pMax latencies.
- Regional deployment improves data locality.
- Tiered caching reduces I/O requirements.
Method
Meta rebuilt its BLOB storage with a unified metadata schema, eliminated dataplane proxies, and implemented a tiered caching system with prefetching and on-demand data hydration.
In practice
- Implement unified metadata for O(1) lookups.
- Use client-side direct data streaming.
- Deploy distributed data and metadata caches.
Topics
- AI Storage Architecture
- GPU Utilization
- Data Caching
- Metadata Management
- Large-scale AI Training
- BLOB Storage
Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Engineering at Meta.