Cleaning Logs for Downstream Tasks (Registered Report)
Summary
This registered report introduces LogPurifier, a novel task-agnostic log-cleaning approach designed to enhance the performance of downstream analysis tasks by identifying and removing "free-standing" messages from software execution logs. Free-standing messages, such as periodic heartbeats, are irrelevant to functional behavior and lack dependencies, often degrading the effectiveness and efficiency of tools for model inference (MI) and anomaly detection (AD). LogPurifier employs a two-step method: it calculates dependency scores between log message templates based on co-occurrences and then uses a Mean-Shift clustering algorithm for segmentation to distinguish and remove these noisy templates. The paper outlines a comprehensive empirical evaluation plan to assess LogPurifier's impact on MI (using MINT with synthetic FSM logs, varying noise rates from 0.1 to 0.9) and AD (using Invariant Mining and One-Class SVM on real-world datasets like BGL, Thunderbird, Spirit, across seven time window sizes). The evaluation will measure effectiveness via precision/recall and efficiency via execution/training time, comparing against LogSed and LogBoost using linear mixed-effects models.
Key takeaway
For Machine Learning Engineers dealing with noisy log data, consider integrating a log-cleaning preprocessing step like LogPurifier. Your models for inference or anomaly detection will likely see improved effectiveness and efficiency by removing free-standing, non-functional messages. This approach reduces computational costs and enhances model accuracy, especially in black-box settings. Plan to evaluate its impact on your specific downstream tasks using metrics like precision, recall, and execution time.
Key insights
LogPurifier cleans logs by removing non-functional, independent messages using dependency scores and clustering.
Principles
- Log quality issues limit usefulness for downstream analysis tasks.
- Free-standing messages lack dependency on predecessors or successors.
- Co-occurrence dependency scores can distinguish log message types.
Method
LogPurifier calculates dependency scores between log message templates based on co-occurrences, then applies Mean-Shift clustering to segment and remove free-standing messages.
In practice
- Evaluate log cleaning impact on model inference accuracy.
- Assess log cleaning benefits for anomaly detection efficiency.
Topics
- Log Cleaning
- Free-standing Messages
- Model Inference
- Anomaly Detection
- Dependency Analysis
- Mean-Shift Clustering
Code references
Best for: AI Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.