Getting a transit-fraud detector into production by fixing the features, not the model
Summary
A real-time fraud detection system for Rio de Janeiro's public transit payments was successfully deployed by focusing on feature engineering rather than model tuning. Previous attempts to detect money laundering, where fraudsters used cloned cards to recharge new fare cards at adjacent terminals within minutes, failed due to low recall. The breakthrough came from exploratory data analysis (EDA), revealing that the signal lay in spatial-temporal patterns of recharges, not individual transactions. New features, such as counts of recharges in 5- and 15-minute windows at specific locations and card freshness, enabled an XGBoost model to achieve production-grade recall while maintaining over 95% precision. The system incorporates temporal validation, sample weighting, and an MLOps loop using Apache Airflow and MLflow for daily scoring, human labeling, and monitored retraining with manual promotion.
Key takeaway
For MLOps Engineers or Data Scientists struggling with low recall in fraud detection, prioritize deep exploratory data analysis to uncover hidden spatial-temporal patterns. Your model's performance hinges on making the fraud signal visible through robust feature engineering, not just hyperparameter tuning. Implement a continuous MLOps loop with temporal validation and human-in-the-loop labeling to adapt to evolving adversarial tactics, ensuring your system remains effective and trustworthy in production.
Key insights
Effective fraud detection prioritizes feature engineering over model complexity, especially for evolving adversarial patterns.
Principles
- Fraud detection is a representation problem, not just a modeling problem.
- Adversarial patterns necessitate continuous monitoring and retraining.
- Temporal validation is crucial for time-series data.
Method
The process involves extensive EDA to identify spatial-temporal fraud patterns, engineering features to capture these "trails," training an XGBoost model with temporal cross-validation, and deploying it within an MLOps loop for continuous scoring, human labeling, and monitored retraining.
In practice
- Investigate spatial-temporal patterns when recall is low.
- Engineer features for "card freshness" and recharge clusters.
- Implement temporal validation for time-series models.
Topics
- Fraud Detection
- Feature Engineering
- MLOps
- XGBoost
- Temporal Validation
- Apache Airflow
- MLflow
Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.