CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

2026-06-05 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

CL-CLIP is a new CLIP-based continual object detection (COD) framework designed to overcome catastrophic forgetting in open-vocabulary detectors. Existing models like F-ViT struggle to retain knowledge of previously learned categories when continually updated with new ones. CL-CLIP addresses this by employing cost-volume-guided category decoupling. It computes a CLIP image-text similarity cost volume, inspired by CAT-Seg, which generates dense category-wise response maps. This zero-shot spatial prior then separates shared region features into class-specific pathways, processed by a Multi-Expert RoI head. Experiments on PASCAL VOC and MS-COCO datasets demonstrate that CL-CLIP substantially improves the F-ViT baseline during continual fine-tuning, achieving competitive performance in adapting to new categories while preserving base-class detection abilities.

Key takeaway

For Machine Learning Engineers developing object detection systems that require continuous adaptation, CL-CLIP provides a robust framework to mitigate catastrophic forgetting. If your models, especially CLIP-based ones like F-ViT, struggle to retain knowledge of old categories when learning new ones, consider implementing cost-volume-guided category decoupling. This approach allows your detector to efficiently acquire new object classes while preserving performance on previously learned base categories, ensuring long-term model stability in dynamic environments.

Key insights

CL-CLIP enhances continual object detection by decoupling categories via CLIP image-text similarity cost volumes, preventing catastrophic forgetting.

Principles

Continual learning needs knowledge preservation.
Vision-language models offer zero-shot detection.
Decoupling features aids continual adaptation.

Method

CL-CLIP computes a CLIP image-text similarity cost volume to create zero-shot spatial priors. This decouples shared region features into class-specific pathways, processed by a Multi-Expert RoI head for continual learning.

In practice

Update object detectors with new categories.
Maintain performance on existing classes.
Improve F-ViT baseline continually.

Topics

Continual Object Detection
CLIP Models
Catastrophic Forgetting
Open-Vocabulary Detection
Cost Volume
Multi-Expert RoI

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.