TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
Summary
TingIS is an end-to-end system for real-time risk event discovery from noisy customer incidents in large-scale cloud-native services. It addresses challenges like extreme noise, high throughput, and semantic complexity by employing a multi-stage event linking engine. This engine combines efficient indexing with Large Language Models (LLMs) to merge events and extract actionable incidents from diverse user descriptions. TingIS also features a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in production, TingIS handles over 2,000 messages per minute and 300,000 messages daily, achieving a P90 alert latency of 3.5 minutes and a 95% discovery rate for high-priority incidents, outperforming baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.
Key takeaway
For AI Architects or CTOs managing large-scale cloud services, TingIS demonstrates a robust approach to real-time incident discovery. Its integration of LLMs with efficient indexing and multi-dimensional noise reduction offers a blueprint for improving alert latency and discovery rates. Consider adopting similar multi-stage linking and cascaded routing mechanisms to enhance your incident management systems and reduce downtime.
Key insights
TingIS uses LLMs and multi-stage linking to extract actionable incidents from noisy, high-throughput customer data.
Principles
- Combine LLMs with efficient indexing for event merging.
- Integrate domain knowledge for noise reduction.
- Cascaded routing improves business attribution.
Method
TingIS employs a multi-stage event linking engine with LLMs for merging, cascaded routing for attribution, and a multi-dimensional noise reduction pipeline using domain knowledge, statistics, and behavioral filtering.
In practice
- Use LLMs for semantic event merging.
- Implement cascaded routing for incident attribution.
- Apply behavioral filtering for noise reduction.
Topics
- Real-time Risk Discovery
- Customer Incident Management
- Large Language Models
- Cloud-Native Services
- Noise Reduction
Best for: AI Architect, Research Scientist, CTO, AI Scientist, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.