Migrating Data Ingestion Systems at Meta Scale
Summary
Meta has successfully revamped its data ingestion system, migrating petabytes of social graph data from one of the world's largest MySQL deployments to a new, more efficient, and reliable self-managed data warehouse service. This large-scale transition involved moving 100% of the workload from customer-owned pipelines, addressing instability under strict data landing time requirements. The migration of tens of thousands of ingestion jobs was managed through a structured three-phase lifecycle: Shadow, Reverse Shadow, and Cleanup. Key strategies included establishing clear job promotion criteria, developing custom data quality analysis tools that logged mismatches to Scuba for real-time debugging, and implementing robust rollout and rollback mechanisms. Automated tooling continuously monitored job status and managed promotions, while batch migration planning optimized resource use by categorizing jobs and avoiding those with known issues.
Key takeaway
For Data Engineers or MLOps Engineers managing critical data ingestion systems, Meta's migration strategy offers a robust blueprint. You should adopt a phased migration lifecycle, starting with shadow testing in pre-production to validate data quality and resource usage. Implement a reverse shadow phase to enable continuous validation and rapid rollback capabilities, minimizing impact from potential issues. Automate job monitoring and promotion criteria to efficiently manage large-scale transitions, ensuring data integrity and operational stability throughout the process.
Key insights
Large-scale data system migrations require a phased approach with continuous validation and automated controls to ensure data integrity and operational reliability.
Principles
- Data integrity and operational reliability are paramount during migration.
- Phased rollout with continuous validation minimizes risk.
- Automation is critical for managing migration at scale.
Method
Implement a three-phase migration lifecycle: Shadow (pre-production testing), Reverse Shadow (new system to production, old as shadow), and Cleanup. Verify data consistency, latency, and resource use at each step.
In practice
- Use shadow jobs in pre-production to expose new systems to real data.
- Employ a reverse shadow phase for ongoing data quality signals and fast rollback.
- Automate job promotion/demotion based on defined migration criteria.
Topics
- Data Ingestion
- System Migration
- Data Quality Assurance
- Change Data Capture
- Shadow Testing
- Large-Scale Data Systems
Best for: Data Engineer, MLOps Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Engineering at Meta.