Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation
Summary
A new hierarchical VFM-driven knowledge distillation framework addresses the challenges of point-supervised Infrared Small Target Detection (ISTD), where dense pixel-wise annotations are expensive. The framework leverages a frozen Vision Foundation Model (VFM) during training to overcome semantic deficiencies in lightweight Convolutional Neural Network (CNN) detectors, which often lead to noisy pseudo-masks and unstable optimization. It formulates point-supervised learning as a bilevel optimization process, with an inner loop adapting a VFM-embedded teacher on reweighted samples and an outer loop transferring validation-guided knowledge to a lightweight student. The framework introduces Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers and employs a dynamic collaborative learning strategy with cluster-level sample reweighting for robustness. Experiments on the SIRST3 dataset demonstrate consistent improvements in detection accuracy and training stability across various ISTD backbones, with the student network maintaining inference efficiency.
Key takeaway
For research scientists developing point-supervised ISTD models, you should consider implementing a VFM-driven hierarchical knowledge distillation framework. This approach, particularly with Semantic-Conditioned Affine Modulation (SCAM) and dynamic collaborative learning, can significantly stabilize training and improve detection accuracy, especially for faint and camouflaged targets, without increasing the inference-time complexity of your lightweight student models. Prioritize validation-based generalization feedback to mitigate overfitting to training-set biases.
Key insights
VFM-driven hierarchical knowledge distillation stabilizes point-supervised ISTD by injecting robust semantics into lightweight CNNs.
Principles
- Bilevel optimization improves generalization.
- Validation feedback counters training-set bias.
- Gated semantic injection enhances stability.
Method
The method uses a bilevel optimization framework with a VFM-embedded teacher and a lightweight CNN student. SCAM injects VFM semantics, and dynamic collaborative learning reweights samples for robustness.
In practice
- Use DINOv3 ViT-S+/16 as a frozen VFM.
- Train for 300 epochs with AdamW, batch size 16.
- Perform bilevel updates every 5 epochs.
Topics
- Infrared Small Target Detection
- Point Supervision
- Knowledge Distillation
- Vision Foundation Models
- Bilevel Optimization
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.