The Feature Selection Trap: Why ‘More Data’ Can Actively Hurt Your Machine Learning Model

· Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

A deep learning experiment on landslide detection using satellite data challenges the "more data is better" assumption in feature selection. Researchers initially used 30 channels, including 14 raw Sentinel-2 and ALOS PALSAR bands plus 16 engineered features, with a U-Net++ model. This extensive input yielded only a 0.2% F1 score improvement over 14 raw bands, illustrating the Hughes Phenomenon. Sequential Forward Floated Search (SFFS) was then employed, a dynamic algorithm that iteratively adds and removes features based on their complementary value. The SFFS process, implemented with a U-Net++ model with a ResNet-50 backbone and a specialized training strategy for class imbalance, identified an optimal subset of just 8 bands. This reduction, from 30 to 8 channels, maintained the original F1 score while drastically cutting memory footprint by 75% and GPU VRAM utilization by 40%, reducing training time per epoch from 10 minutes to 6 minutes on an NVIDIA T4 GPU. The selected bands included B3 Green, B4 Red, B5 RE1, B8 NIR, B11 SWIR1, B13 Slope, B14 DEM, and B22 Gray, with B4 Red and B13 Slope being dominant.

Key takeaway

For MLOps Engineers deploying deep learning models in resource-constrained environments, you should actively question the "more features are better" assumption. Implementing principled feature selection like Sequential Forward Floated Search (SFFS) can drastically reduce model complexity and memory footprint by up to 75% without sacrificing accuracy. This enables faster training, lower inference costs, and improves model explainability, making debugging and critical decision-making more feasible.

Key insights

Adding more features does not always improve ML model performance and can actively degrade it due to redundancy.

Principles

Method

Sequential Forward Floated Search (SFFS) dynamically adds and removes features, checking for redundancy after each addition to find a genuinely complementary subset.

In practice

Topics

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.