[D] Telecom modernization on legacy OSS, what actually worked for ML data extraction

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Advanced, quick

Summary

A telecom modernization project successfully integrated machine learning into a legacy Operational Support System (OSS) stack from the early 2000s, characterized by a C++ core, Perl glue, and a lack of APIs or event hooks. The primary challenge was data extraction from this live, mission-critical system, rather than the ML model development itself. Unsuccessful approaches included application-layer log parsing due to format drift, direct instrumentation of legacy C++ binaries, and ETL polling the database, which caused performance issues. Effective data extraction methods involved Change Data Capture (CDC) via Debezium on the MySQL binlog, eBPF uprobes on C++ function calls, and DBI hooks on the Perl side. A significant effort was also required for data normalization due to fifteen years of format drift, repurposed columns, and timezone inconsistencies.

Key takeaway

For AI Architects or ML Engineers tasked with integrating machine learning into deeply entrenched legacy systems, prioritize non-invasive data extraction techniques. Your project's success will hinge on robust data capture methods like CDC, eBPF, or DBI hooks, and you should allocate substantial effort to data normalization, as format drift and undocumented changes will be major hurdles.

Key insights

Extracting data from legacy systems for ML requires non-invasive, robust methods to overcome inherent architectural limitations.

Principles

Method

Utilize CDC (Debezium on binlog), eBPF uprobes for non-DB C++ calls, and DBI hooks for Perl to extract data from legacy systems without direct application modification.

In practice

Topics

Best for: AI Architect, AI Engineer, Machine Learning Engineer, MLOps Engineer, Data Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.