We Used 5 Outlier Detection Methods on a Real Dataset: They Disagreed on 96% of Flagged Samples
Summary
An experiment tested five common outlier detection methods on a real-world dataset of 6,497 Portuguese wines to assess their consistency and effectiveness. The methods included IQR, Z-Score, Robust Z-Score, Isolation Forest, Local Outlier Factor (LOF), and Elliptic Envelope. Initial findings revealed that a naive approach of flagging outliers based on a single extreme feature led to inflated results, with 23-26% of wines flagged due to multiple testing issues; this was corrected by requiring at least two extreme features. The study found poor agreement among methods, with Jaccard similarity ranging from 0.10 to 0.30, indicating they identify different types of "unusual" data points. Only 0.5% of samples were flagged by all four primary methods, while 2.2% were flagged by three or more. The analysis also confirmed that extreme-quality wines were twice as likely to be consensus outliers, providing a sanity check.
Key takeaway
For data scientists and machine learning engineers working with real-world, skewed datasets, you should define the specific type of "unusual" data you seek before selecting an outlier detection method. Given the low agreement among different techniques, employ multiple methods and prioritize samples flagged by a consensus of three or more for higher confidence. Always check your data's distribution and avoid methods like Standard Z-Score or Elliptic Envelope if your data is heavily skewed.
Key insights
Different outlier detection methods identify distinct types of "unusual" data points, leading to low agreement.
Principles
- Multiple testing inflates outlier counts.
- Outlier definition varies by method.
- Consensus improves outlier detection confidence.
Method
The study corrected for multiple testing by requiring at least two extreme features per sample and used Jaccard similarity to quantify method agreement on a real-world wine dataset.
In practice
- Use Robust Z-Score for skewed data.
- Scale data separately for distinct subpopulations.
- Fit outlier models on training data only.
Topics
- Outlier Detection Methods
- Robust Statistics
- Isolation Forest
- Local Outlier Factor
- Data Skewness
Best for: Data Scientist, Machine Learning Engineer, AI Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by KDnuggets.