Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment

2022-12-27 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Expert, extended

Summary

A study conducted at a large international bank, ING, presents a predictive incident risk scoring approach designed to prevent IT incidents caused by changes in highly regulated financial environments. The research compares the bank's existing rule-based risk assessment with three machine learning models: HGBC, LightGBM, and XGBoost, using a one-year dataset of 175,000 closed change tickets and linked priority 1 and 2 incidents. LightGBM emerged as the best-performing model, especially when enriched with aggregated team metrics like change success rates and incident counts. The approach emphasizes auditability and explainability, utilizing SHAP values to provide feature-level insights and ensure transparent, traceable decisions, which is critical for regulatory compliance under frameworks like DORA and the EU AI Act. This data-driven method significantly outperforms the baseline rule-based system in identifying high-risk changes, enabling proactive risk mitigation and enhancing IT operational reliability.

Key takeaway

For CTOs and VPs of Engineering managing IT change in regulated sectors, adopting data-driven ML models like LightGBM for incident prediction offers superior risk assessment compared to traditional rule-based systems. You should integrate explainable AI (XAI) techniques, such as SHAP, into your change management workflows to ensure auditability and build trust. This enables proactive identification of high-risk changes, allowing your teams to apply targeted scrutiny and preventive actions, thereby enhancing operational resilience and compliance with regulations like DORA and the EU AI Act.

Key insights

Interpretable ML models can effectively predict IT incident risk from changes in regulated environments, outperforming rule-based systems.

Principles

Explainability is crucial for regulatory compliance and user trust.
Aggregated team metrics enhance predictive model accuracy.
High recall is prioritized over precision in incident prevention.

Method

Train boosted tree classifiers (LightGBM) on historical change and incident data, enriched with aggregated team metrics. Use SHAP for feature-level interpretability to provide incident prediction scores for planned IT changes.

In practice

Use SHAP to explain model predictions to engineers.
Incorporate team performance metrics into risk models.
Prioritize recall to identify most incident-inducing changes.

Topics

Predictive Incident Prevention
IT Change Management
Explainable AI
LightGBM Model
Financial Sector Regulation

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, MLOps Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.