Comparison of Outlier Detection Algorithms on String Data

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, short

Summary

A new study, submitted on January 13, 2026, compares two string data outlier detection algorithms, addressing a gap in machine learning research predominantly focused on numerical data. The first algorithm is a variant of the local outlier factor (LOF) algorithm, adapted for string data using a Levenshtein measure to calculate dataset density. This variant introduces a differently weighted Levenshtein measure that considers hierarchical character classes, allowing for tuning to specific string datasets. The second algorithm is novel, based on a hierarchical left regular expression learner that infers a regular expression for expected data. Experimental results across various datasets and parameters demonstrate that both algorithms can conceptually identify outliers in string data. The regular expression-based algorithm excels when expected values have distinct structures different from outliers, while LOF variants perform best when edit distances between expected data and outliers are sufficiently distinct.

Key takeaway

For AI Scientists and Research Scientists developing robust data cleaning or anomaly detection systems for string data, consider implementing these specialized algorithms. The regular expression-based approach is particularly effective for data with distinct structural patterns, such as system log files, while the Levenshtein-based LOF variant is suitable when outlier string differences are quantifiable by edit distance. Integrating these methods can significantly enhance the accuracy of outlier identification in non-numerical datasets.

Key insights

Two novel algorithms effectively detect outliers in string data, a less-explored area in machine learning.

Principles

Method

The study tailors the local outlier factor (LOF) using a weighted Levenshtein measure and introduces a new algorithm based on hierarchical left regular expression learning to infer expected data patterns.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.