Cleaning Logs for Downstream Tasks (Registered Report)

2026-06-26 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

This registered report introduces LogPurifier, a novel task-agnostic log-cleaning approach designed to enhance the performance of downstream analysis tasks by identifying and removing "free-standing" messages from software execution logs. Free-standing messages, such as periodic heartbeats, are irrelevant to functional behavior and lack dependencies, often degrading the effectiveness and efficiency of tools for model inference (MI) and anomaly detection (AD). LogPurifier employs a two-step method: it calculates dependency scores between log message templates based on co-occurrences and then uses a Mean-Shift clustering algorithm for segmentation to distinguish and remove these noisy templates. The paper outlines a comprehensive empirical evaluation plan to assess LogPurifier's impact on MI (using MINT with synthetic FSM logs, varying noise rates from 0.1 to 0.9) and AD (using Invariant Mining and One-Class SVM on real-world datasets like BGL, Thunderbird, Spirit, across seven time window sizes). The evaluation will measure effectiveness via precision/recall and efficiency via execution/training time, comparing against LogSed and LogBoost using linear mixed-effects models.

Key takeaway

For Machine Learning Engineers dealing with noisy log data, consider integrating a log-cleaning preprocessing step like LogPurifier. Your models for inference or anomaly detection will likely see improved effectiveness and efficiency by removing free-standing, non-functional messages. This approach reduces computational costs and enhances model accuracy, especially in black-box settings. Plan to evaluate its impact on your specific downstream tasks using metrics like precision, recall, and execution time.

Key insights

LogPurifier cleans logs by removing non-functional, independent messages using dependency scores and clustering.

Principles

Log quality issues limit usefulness for downstream analysis tasks.
Free-standing messages lack dependency on predecessors or successors.
Co-occurrence dependency scores can distinguish log message types.

Method

LogPurifier calculates dependency scores between log message templates based on co-occurrences, then applies Mean-Shift clustering to segment and remove free-standing messages.

In practice

Evaluate log cleaning impact on model inference accuracy.
Assess log cleaning benefits for anomaly detection efficiency.

Topics

Log Cleaning
Free-standing Messages
Model Inference
Anomaly Detection
Dependency Analysis
Mean-Shift Clustering

Code references

neilwalkinshaw/mintframework

Best for: AI Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.