Astragalus: Automatic Configuration Repair for Production Networks
Summary
Astragalus is an automatic configuration repair (ACR) tool designed to address network misconfigurations, a major cause of service outages. Unlike existing "semantic-driven" approaches that struggle with scalability due to complex SMT constraints, Astragalus employs a "syntax-driven" method inspired by automatic program repair. It utilizes a "localize-fix-validate" pipeline to efficiently identify and correct errors. The tool demonstrated high effectiveness, repairing 100% of incidents in synthesized networks and 97.5% in a real network with 15 types of injected errors, averaging 7.36 seconds per repair. It also provided valid suggestions within 6 minutes for four recent incidents in a production network of O(1,000)-O(10,000) devices, proving significantly faster and more scalable than prior solutions like AED and CEL.
Key takeaway
For network operators managing large-scale production networks and struggling with misconfiguration-induced outages, you should evaluate "syntax-driven" automatic configuration repair (ACR) tools like Astragalus. This approach significantly accelerates fault localization and repair, often resolving incidents in seconds, far surpassing the scalability of "semantic-driven" methods. While not every complex root cause is directly identified, these tools provide actionable suggestions that drastically reduce manual troubleshooting time, improving network stability and operational efficiency.
Key insights
Syntax-driven automatic configuration repair offers superior scalability and generality over semantic-driven methods.
Principles
- Network configurations exhibit "plastic surgery hypothesis" due to role redundancy.
- Fast network verification enables efficient validation of numerous candidate fixes.
- Syntax-driven repair generalizes across diverse protocols and vendors.
Method
Astragalus employs a "localize-fix-validate" pipeline: localize suspicious lines via SBFL, generate candidate fixes (remove, insert, modify) from existing configurations, then validate using network verifiers.
In practice
- Apply Ochiai as the default SBFL technique for localization.
- Prioritize "Remove" operations when generating configuration fixes.
- Integrate existing network verifiers for rapid candidate fix validation.
Topics
- Automatic Configuration Repair
- Network Configuration
- Syntax-driven Repair
- Fault Localization
- Network Verification
- Data Center Networks
Best for: Research Scientist, IT Professional, Operations Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.