Getting a transit-fraud detector into production by fixing the features, not the model

· Source: Machine Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

A real-time fraud detection system for Rio de Janeiro's public transit payments was successfully deployed by focusing on feature engineering rather than model tuning. Previous attempts to detect money laundering, where fraudsters used cloned cards to recharge new fare cards at adjacent terminals within minutes, failed due to low recall. The breakthrough came from exploratory data analysis (EDA), revealing that the signal lay in spatial-temporal patterns of recharges, not individual transactions. New features, such as counts of recharges in 5- and 15-minute windows at specific locations and card freshness, enabled an XGBoost model to achieve production-grade recall while maintaining over 95% precision. The system incorporates temporal validation, sample weighting, and an MLOps loop using Apache Airflow and MLflow for daily scoring, human labeling, and monitored retraining with manual promotion.

Key takeaway

For MLOps Engineers or Data Scientists struggling with low recall in fraud detection, prioritize deep exploratory data analysis to uncover hidden spatial-temporal patterns. Your model's performance hinges on making the fraud signal visible through robust feature engineering, not just hyperparameter tuning. Implement a continuous MLOps loop with temporal validation and human-in-the-loop labeling to adapt to evolving adversarial tactics, ensuring your system remains effective and trustworthy in production.

Key insights

Effective fraud detection prioritizes feature engineering over model complexity, especially for evolving adversarial patterns.

Principles

Method

The process involves extensive EDA to identify spatial-temporal fraud patterns, engineering features to capture these "trails," training an XGBoost model with temporal cross-validation, and deploying it within an MLOps loop for continuous scoring, human labeling, and monitored retraining.

In practice

Topics

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.