The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models
Summary
"The Professor" introduces a multi-teacher unsupervised prompt distillation method for compressing vision-language models (VLMs) into lightweight student models. This extends PromptKD (CVPR 2024), which used a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. TheProfessor utilizes a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14 with pre-computed logits. Evaluation across four base-to-novel datasets (Caltech-101, DTD, UCF101, EuroSAT) showed improvements. Confidence-weighted ensembling improved average HM from 87.52 to 89.28 (+1.77 points) over single-teacher PromptKD. Equal-probability ensembling also gained, reaching 88.88 (+1.37 points). Benefits were most significant on domain-shifted datasets like EuroSAT, where confidence weighting yielded a +5.78 HM increase. This indicates complementary supervision from multiple teachers is particularly effective under domain shift.
Key takeaway
For Machine Learning Engineers deploying vision-language models, especially when facing domain shift challenges, you should consider adopting multi-teacher prompt distillation. This approach, particularly with confidence-weighted ensembling, can significantly improve student model performance and robustness. You can achieve average HM gains of +1.77 points, with up to +5.78 points on domain-shifted datasets like EuroSAT. This is by utilizing complementary supervision from diverse teachers. This strategy offers a clear path to more efficient yet performant VLM deployments.
Key insights
Multi-teacher prompt distillation significantly improves VLM student performance, particularly under domain shift, through complementary teacher supervision.
Principles
- Multi-teacher ensembles enhance VLM distillation.
- Complementary supervision aids domain shift performance.
- Confidence weighting improves ensemble effectiveness.
Method
TheProfessor distills VLMs using a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14 with pre-computed logits. It evaluates confidence-weighted and equal-probability ensembling strategies.
In practice
- Employ multi-teacher distillation for VLM compression.
- Select diverse teachers for domain-shifted datasets.
- Apply confidence-weighted ensembling for superior gains.
Topics
- Prompt Distillation
- Vision-Language Models
- Multi-Teacher Ensembles
- Model Compression
- Domain Shift
- Unsupervised Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.