CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

CL-CLIP is a new framework for Continual Object Detection (COD) that addresses catastrophic forgetting in CLIP-based open-vocabulary detectors. It utilizes a CLIP image-text similarity cost volume to create zero-shot spatial priors, which then decouple shared region features into class-specific pathways. These pathways are processed by a Multi-Expert RoI head, where each category has a dedicated convolutional expert that is frozen after training. This design, combined with drift regularization on FPN modules and an orthogonality loss, reduces cross-class interference. Experiments on PASCAL VOC and MS-COCO show CL-CLIP significantly improves the F-ViT baseline, achieving 7.4 to 9.6 mAP points higher on VOC splits (e.g., 74.7 mAP@A on 10+10) and maintaining stronger performance across 4-task COCO benchmarks compared to existing COD methods like IOR and MMA. It consistently outperforms other CLIP variants like EVA-CLIP and SigLIP2 in continual settings.

Key takeaway

For machine learning engineers developing continual object detection systems, CL-CLIP offers a robust approach to mitigate catastrophic forgetting in CLIP-based models. By implementing cost-volume-guided category decoupling and multi-expert RoI heads, you can preserve old-class detection abilities while adapting to new categories. Consider integrating this architectural separation and drift regularization to achieve a better stability-plasticity trade-off than traditional fine-tuning or replay methods.

Key insights

CLIP-based continual object detection benefits from category decoupling via a frozen image-text similarity cost volume.

Principles

Frozen CLIP priors offer stable, category-indexed spatial signals.
Decoupling features into class-specific pathways reduces forgetting.
Orthogonality loss minimizes cross-category spatial co-activation.

Method

CL-CLIP constructs a CLIP cost volume for zero-shot spatial priors, separating features into class-specific pathways. A Multi-Expert RoI head with frozen per-category experts and FPN drift regularization mitigates forgetting.

In practice

Use CLIP cost volumes to generate class-specific spatial priors.
Implement per-category convolutional experts that freeze after training.
Apply orthogonality loss to reduce spatial overlap between categories.

Topics

Continual Learning
Object Detection
CLIP Models
Open-Vocabulary Detection
Catastrophic Forgetting
Feature Decoupling
Multi-Expert Networks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.