Schema-Agnostic Process Trace Construction: From Raw Tables to Execution Behavior

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Expert, extended

Summary

A novel schema-agnostic pipeline automatically reconstructs process execution traces directly from raw relational data, addressing challenges in modern OLTP environments characterized by drifting schemas, sparse keys, and dispersed execution traces. This four-stage pipeline first identifies key and timestamp columns using statistical profiling, then discovers inter-table connections via type-aware similarity measures like Jaccard and Kolmogorov-Smirnov. It subsequently assembles and orders events for each case, accommodating multiple timestamp fields. Finally, a Temporal Convolutional Network (TCN) learns likely ordering and flow relations across systems. Evaluations on TPC-H/E benchmarks, synthetic corpora, and an industry dataset demonstrate its effectiveness, achieving 85% accuracy in predicting the next event and recovering approximately 82% of ground-truth precedence relations. The pipeline maintains over 80% of its performance under significant data drift, offering a scalable solution for dynamic information systems.

Key takeaway

For Data Scientists or ML Engineers tasked with building process trace reconstruction pipelines in evolving OLTP environments, you should adopt schema-agnostic approaches. This pipeline demonstrates that high-fidelity event logs can be automatically generated from raw, key-sparse relational data using statistical signals and TCNs, eliminating the need for predefined schemas or manual configuration. Consider integrating similar data-driven methods to ensure your analytical foundations remain robust as system designs continuously change.

Key insights

Automated process trace reconstruction from raw, schema-poor relational data is achievable via statistical signals and Temporal Convolutional Networks.

Principles

Method

The pipeline profiles identifier/timestamp columns, discovers inter-table links via statistical similarity, assembles timestamped rows into event sequences, and learns cross-table precedence using a Temporal Convolutional Network.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.