Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation
Summary
A new hierarchical VFM-driven knowledge distillation framework is proposed to stabilize point-supervised Infrared Small Target Detection (ISTD). This method addresses the issue of lightweight CNN detectors lacking sufficient semantics, which leads to noisy pseudo-masks and unstable optimization when using point supervision. The framework formulates point-supervised learning as a bilevel optimization process, where an inner loop adapts a Vision Foundation Model (VFM)-embedded teacher on reweighted samples, and an outer loop transfers validation-guided knowledge to a lightweight student. Additionally, the framework introduces Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features and employs a dynamic collaborative learning strategy with cluster-level sample reweighting to enhance robustness. Experiments across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability.
Key takeaway
For research scientists developing lightweight CNN detectors for ISTD with limited annotations, this framework offers a robust approach to overcome noisy pseudo-masks and unstable optimization. By adopting the hierarchical VFM-driven knowledge distillation and Semantic-Conditioned Affine Modulation, you can significantly improve detection accuracy and training stability, making your models more reliable for real-world applications.
Key insights
Hierarchical knowledge distillation with VFM semantics stabilizes point-supervised infrared small target detection.
Principles
- Bilevel optimization mitigates pseudo-label noise.
- VFM semantics enhance CNN feature representation.
- Dynamic reweighting improves robustness to imperfect labels.
Method
The method uses a bilevel optimization with a VFM-embedded teacher and a lightweight student, incorporating Semantic-Conditioned Affine Modulation (SCAM) and dynamic collaborative learning with cluster-level sample reweighting.
In practice
- Integrate VFMs into teacher models for semantic enrichment.
- Apply SCAM for multi-layer feature injection.
- Utilize cluster-level reweighting for noisy datasets.
Topics
- Infrared Small Target Detection
- Knowledge Distillation
- Vision Foundation Models
- Point Supervision
- Semantic-Conditioned Affine Modulation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.