Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

2026-05-15 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A new hierarchical VFM-driven knowledge distillation framework addresses the challenges of point-supervised Infrared Small Target Detection (ISTD), where dense pixel-wise annotations are expensive. The framework leverages a frozen Vision Foundation Model (VFM) during training to overcome semantic deficiencies in lightweight Convolutional Neural Network (CNN) detectors, which often lead to noisy pseudo-masks and unstable optimization. It formulates point-supervised learning as a bilevel optimization process, with an inner loop adapting a VFM-embedded teacher on reweighted samples and an outer loop transferring validation-guided knowledge to a lightweight student. The framework introduces Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers and employs a dynamic collaborative learning strategy with cluster-level sample reweighting for robustness. Experiments on the SIRST3 dataset demonstrate consistent improvements in detection accuracy and training stability across various ISTD backbones, with the student network maintaining inference efficiency.

Key takeaway

For research scientists developing point-supervised ISTD models, you should consider implementing a VFM-driven hierarchical knowledge distillation framework. This approach, particularly with Semantic-Conditioned Affine Modulation (SCAM) and dynamic collaborative learning, can significantly stabilize training and improve detection accuracy, especially for faint and camouflaged targets, without increasing the inference-time complexity of your lightweight student models. Prioritize validation-based generalization feedback to mitigate overfitting to training-set biases.

Key insights

VFM-driven hierarchical knowledge distillation stabilizes point-supervised ISTD by injecting robust semantics into lightweight CNNs.

Principles

Bilevel optimization improves generalization.
Validation feedback counters training-set bias.
Gated semantic injection enhances stability.

Method

The method uses a bilevel optimization framework with a VFM-embedded teacher and a lightweight CNN student. SCAM injects VFM semantics, and dynamic collaborative learning reweights samples for robustness.

In practice

Use DINOv3 ViT-S+/16 as a frozen VFM.
Train for 300 epochs with AdamW, batch size 16.
Perform bilevel updates every 5 epochs.

Topics

Infrared Small Target Detection
Point Supervision
Knowledge Distillation
Vision Foundation Models
Bilevel Optimization

Code references

yuanhang-yao/semantic-prior

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.