CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection
Summary
CL-CLIP is a new framework for Continual Object Detection (COD) that addresses catastrophic forgetting in CLIP-based open-vocabulary detectors. It utilizes a CLIP image-text similarity cost volume to create zero-shot spatial priors, which then decouple shared region features into class-specific pathways. These pathways are processed by a Multi-Expert RoI head, where each category has a dedicated convolutional expert that is frozen after training. This design, combined with drift regularization on FPN modules and an orthogonality loss, reduces cross-class interference. Experiments on PASCAL VOC and MS-COCO show CL-CLIP significantly improves the F-ViT baseline, achieving 7.4 to 9.6 mAP points higher on VOC splits (e.g., 74.7 mAP@A on 10+10) and maintaining stronger performance across 4-task COCO benchmarks compared to existing COD methods like IOR and MMA. It consistently outperforms other CLIP variants like EVA-CLIP and SigLIP2 in continual settings.
Key takeaway
For machine learning engineers developing continual object detection systems, CL-CLIP offers a robust approach to mitigate catastrophic forgetting in CLIP-based models. By implementing cost-volume-guided category decoupling and multi-expert RoI heads, you can preserve old-class detection abilities while adapting to new categories. Consider integrating this architectural separation and drift regularization to achieve a better stability-plasticity trade-off than traditional fine-tuning or replay methods.
Key insights
CLIP-based continual object detection benefits from category decoupling via a frozen image-text similarity cost volume.
Principles
- Frozen CLIP priors offer stable, category-indexed spatial signals.
- Decoupling features into class-specific pathways reduces forgetting.
- Orthogonality loss minimizes cross-category spatial co-activation.
Method
CL-CLIP constructs a CLIP cost volume for zero-shot spatial priors, separating features into class-specific pathways. A Multi-Expert RoI head with frozen per-category experts and FPN drift regularization mitigates forgetting.
In practice
- Use CLIP cost volumes to generate class-specific spatial priors.
- Implement per-category convolutional experts that freeze after training.
- Apply orthogonality loss to reduce spatial overlap between categories.
Topics
- Continual Learning
- Object Detection
- CLIP Models
- Open-Vocabulary Detection
- Catastrophic Forgetting
- Feature Decoupling
- Multi-Expert Networks
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.