Schema-Agnostic Process Trace Construction: From Raw Tables to Execution Behavior
Summary
A novel schema-agnostic pipeline automatically reconstructs process execution traces directly from raw relational data, addressing challenges in modern OLTP environments characterized by drifting schemas, sparse keys, and dispersed execution traces. This four-stage pipeline first identifies key and timestamp columns using statistical profiling, then discovers inter-table connections via type-aware similarity measures like Jaccard and Kolmogorov-Smirnov. It subsequently assembles and orders events for each case, accommodating multiple timestamp fields. Finally, a Temporal Convolutional Network (TCN) learns likely ordering and flow relations across systems. Evaluations on TPC-H/E benchmarks, synthetic corpora, and an industry dataset demonstrate its effectiveness, achieving 85% accuracy in predicting the next event and recovering approximately 82% of ground-truth precedence relations. The pipeline maintains over 80% of its performance under significant data drift, offering a scalable solution for dynamic information systems.
Key takeaway
For Data Scientists or ML Engineers tasked with building process trace reconstruction pipelines in evolving OLTP environments, you should adopt schema-agnostic approaches. This pipeline demonstrates that high-fidelity event logs can be automatically generated from raw, key-sparse relational data using statistical signals and TCNs, eliminating the need for predefined schemas or manual configuration. Consider integrating similar data-driven methods to ensure your analytical foundations remain robust as system designs continuously change.
Key insights
Automated process trace reconstruction from raw, schema-poor relational data is achievable via statistical signals and Temporal Convolutional Networks.
Principles
- Schema-first log engineering is a bottleneck in evolving IS.
- Data-driven similarity infers robust inter-table links.
- Learned temporal patterns order events when timestamps are ambiguous.
Method
The pipeline profiles identifier/timestamp columns, discovers inter-table links via statistical similarity, assembles timestamped rows into event sequences, and learns cross-table precedence using a Temporal Convolutional Network.
In practice
- Generate event logs for process mining tools.
- Support auditing and compliance monitoring.
- Automate execution behavior reconstruction.
Topics
- Process Mining
- Event Log Generation
- Schema-Agnostic Integration
- Temporal Convolutional Networks
- OLTP Systems
- Data Drift
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.