Comparison of Outlier Detection Algorithms on String Data
Summary
A new study, submitted on January 13, 2026, compares two string data outlier detection algorithms, addressing a gap in machine learning research predominantly focused on numerical data. The first algorithm is a variant of the local outlier factor (LOF) algorithm, adapted for string data using a Levenshtein measure to calculate dataset density. This variant introduces a differently weighted Levenshtein measure that considers hierarchical character classes, allowing for tuning to specific string datasets. The second algorithm is novel, based on a hierarchical left regular expression learner that infers a regular expression for expected data. Experimental results across various datasets and parameters demonstrate that both algorithms can conceptually identify outliers in string data. The regular expression-based algorithm excels when expected values have distinct structures different from outliers, while LOF variants perform best when edit distances between expected data and outliers are sufficiently distinct.
Key takeaway
For AI Scientists and Research Scientists developing robust data cleaning or anomaly detection systems for string data, consider implementing these specialized algorithms. The regular expression-based approach is particularly effective for data with distinct structural patterns, such as system log files, while the Levenshtein-based LOF variant is suitable when outlier string differences are quantifiable by edit distance. Integrating these methods can significantly enhance the accuracy of outlier identification in non-numerical datasets.
Key insights
Two novel algorithms effectively detect outliers in string data, a less-explored area in machine learning.
Principles
- Levenshtein distance can adapt LOF for string data.
- Regular expressions can define expected string data structures.
Method
The study tailors the local outlier factor (LOF) using a weighted Levenshtein measure and introduces a new algorithm based on hierarchical left regular expression learning to infer expected data patterns.
In practice
- Apply LOF for string outliers with distinct edit distances.
- Use regex-based detection for structured string data anomalies.
Topics
- String Outlier Detection
- Local Outlier Factor
- Levenshtein Distance
- Regular Expression Learning
- Anomaly Detection
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.