The Pulse: AI load breaks GitHub – why not other vendors?
Summary
GitHub's reliability has significantly deteriorated, with third-party trackers reporting uptime as low as 86% (zero nines) in recent months. This includes a critical data integrity incident on April 23rd, where 2,092 pull requests merged via squash merge produced incorrect commits, impacting companies like Modal and Zipline and requiring manual recovery. Subsequent outages on April 27-29 affected pull requests, issues, and GitHub Actions, often due to an overloaded Elasticsearch cluster. Mitchell Hashimoto, founder of HashiCorp, publicly quit GitHub due to persistent unproductivity caused by these outages. GitHub CTO Vlad Fedorov attributed the issues to unexpected load spikes from AI agents, which increased traffic by approximately 3.5x over two years. The company is also undergoing a complex migration to Azure while simultaneously adjusting its capacity planning from a 10x to a 30x increase, initially projected for October 2025 and February 2026, respectively. Other major vendors, including Google, had already prepared for 10x load increases, suggesting GitHub's challenges are partly self-inflicted due to under-preparation and accumulated technical and organizational debt.
Key takeaway
For CTOs and VP of Engineering evaluating source control platforms, GitHub's recent reliability and data integrity failures, coupled with its slow response to AI-driven load, demand a re-evaluation of its suitability for mission-critical development. You should assess the impact of potential outages and data loss on your team's productivity and consider diversifying your Git hosting strategy or exploring alternatives like GitLab, Bitbucket, or self-hosted solutions to mitigate risks.
Key insights
GitHub's severe reliability issues stem from under-preparedness for AI-driven load spikes and complex infrastructure migration.
Principles
- Data integrity is paramount for developer platforms.
- Proactive capacity planning is crucial for scaling services.
Method
GitHub's CTO identified AI agent-fueled load spikes and an ongoing Azure migration as primary causes for recent reliability issues, compounded by technical and organizational debt.
In practice
- Evaluate alternative Git hosting solutions.
- Consider self-hosting for critical repositories.
Topics
- GitHub Reliability
- Data Integrity Incident
- AI Agent Load
- Azure Migration
- Software Outages
Best for: CTO, VP of Engineering/Data, Software Engineer, MLOps Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Pragmatic Engineer.