Core dump epidemiology: fixing an 18-year-old bug
Summary
OpenAI's engineering team, on June 30, 2026, successfully debugged an 18-year-old race condition in GNU libunwind and a hardware corruption issue affecting their C++ data infrastructure, Rockset. Initially, inexplicable crashes in Rockset, a key component for ChatGPT's data plugins and conversation search, were observed, manifesting as returns to bogus addresses or misaligned stack pointers. The team shifted from individual core dump analysis to population-level epidemiology, building a pipeline to automatically analyze core dumps. This revealed two distinct crash populations: misaligned-stack crashes linked to a single faulty Azure host, and return-to-null crashes caused by a one-instruction race window in GNU libunwind's `_Ux86_64_setcontext` during C++ exception unwinding. The libunwind bug, present for 18 years, became operationally visible due to Rockset's high exception rate, frequent `SIGUSR2` signal delivery, and increased signal handler stack usage. OpenAI denylisted the bad host and switched to libgcc's unwinder, also upstreaming a fix to GNU libunwind.
Key takeaway
For MLOps Engineers debugging intermittent, hard-to-diagnose system crashes, shift from single-case analysis to population-level data epidemiology. Automate core dump analysis to identify distinct crash patterns and underlying causes, like hardware faults or subtle race conditions. This approach prevents conflating issues, enabling targeted fixes and improving overall system reliability. Update your operational tooling and runbooks to incorporate this data-driven debugging strategy.
Key insights
Population-level analysis of crash data can reveal distinct, otherwise inexplicable, underlying issues.
Principles
- Conflating distinct issues hinders debugging.
- High-quality population data is crucial.
- Operational visibility of old bugs depends on system load.
Method
Automate core dump analysis to extract registers, filter false positives, and label crash types, then analyze the resulting dataset for population-level correlations across infrastructure.
In practice
- Implement automated core dump analysis.
- Improve fatal signal handlers for detailed logging.
- Switch to robust, well-maintained libraries.
Topics
- Core Dump Analysis
- C++ Debugging
- Race Conditions
- GNU libunwind
- Azure Infrastructure
- Population-Level Analysis
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Software Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.