CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

CL-CLIP is a new framework for Continual Object Detection (COD) that addresses catastrophic forgetting in CLIP-based open-vocabulary detectors. It utilizes a CLIP image-text similarity cost volume to create zero-shot spatial priors, which then decouple shared region features into class-specific pathways. These pathways are processed by a Multi-Expert RoI head, where each category has a dedicated convolutional expert that is frozen after training. This design, combined with drift regularization on FPN modules and an orthogonality loss, reduces cross-class interference. Experiments on PASCAL VOC and MS-COCO show CL-CLIP significantly improves the F-ViT baseline, achieving 7.4 to 9.6 mAP points higher on VOC splits (e.g., 74.7 mAP@A on 10+10) and maintaining stronger performance across 4-task COCO benchmarks compared to existing COD methods like IOR and MMA. It consistently outperforms other CLIP variants like EVA-CLIP and SigLIP2 in continual settings.

Key takeaway

For machine learning engineers developing continual object detection systems, CL-CLIP offers a robust approach to mitigate catastrophic forgetting in CLIP-based models. By implementing cost-volume-guided category decoupling and multi-expert RoI heads, you can preserve old-class detection abilities while adapting to new categories. Consider integrating this architectural separation and drift regularization to achieve a better stability-plasticity trade-off than traditional fine-tuning or replay methods.

Key insights

CLIP-based continual object detection benefits from category decoupling via a frozen image-text similarity cost volume.

Principles

Method

CL-CLIP constructs a CLIP cost volume for zero-shot spatial priors, separating features into class-specific pathways. A Multi-Expert RoI head with frozen per-category experts and FPN drift regularization mitigates forgetting.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.