Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

2026-05-14 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, medium

Summary

A new hierarchical VFM-driven knowledge distillation framework is proposed to stabilize point-supervised Infrared Small Target Detection (ISTD). This method addresses the issue of lightweight CNN detectors lacking sufficient semantics, which leads to noisy pseudo-masks and unstable optimization when using point supervision. The framework formulates point-supervised learning as a bilevel optimization process, where an inner loop adapts a Vision Foundation Model (VFM)-embedded teacher on reweighted samples, and an outer loop transfers validation-guided knowledge to a lightweight student. Additionally, the framework introduces Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features and employs a dynamic collaborative learning strategy with cluster-level sample reweighting to enhance robustness. Experiments across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability.

Key takeaway

For research scientists developing lightweight CNN detectors for ISTD with limited annotations, this framework offers a robust approach to overcome noisy pseudo-masks and unstable optimization. By adopting the hierarchical VFM-driven knowledge distillation and Semantic-Conditioned Affine Modulation, you can significantly improve detection accuracy and training stability, making your models more reliable for real-world applications.

Key insights

Hierarchical knowledge distillation with VFM semantics stabilizes point-supervised infrared small target detection.

Principles

Bilevel optimization mitigates pseudo-label noise.
VFM semantics enhance CNN feature representation.
Dynamic reweighting improves robustness to imperfect labels.

Method

The method uses a bilevel optimization with a VFM-embedded teacher and a lightweight student, incorporating Semantic-Conditioned Affine Modulation (SCAM) and dynamic collaborative learning with cluster-level sample reweighting.

In practice

Integrate VFMs into teacher models for semantic enrichment.
Apply SCAM for multi-layer feature injection.
Utilize cluster-level reweighting for noisy datasets.

Topics

Infrared Small Target Detection
Knowledge Distillation
Vision Foundation Models
Point Supervision
Semantic-Conditioned Affine Modulation

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.