The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

"The Professor" introduces a multi-teacher unsupervised prompt distillation method for compressing vision-language models (VLMs) into lightweight student models. This extends PromptKD (CVPR 2024), which used a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. TheProfessor utilizes a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14 with pre-computed logits. Evaluation across four base-to-novel datasets (Caltech-101, DTD, UCF101, EuroSAT) showed improvements. Confidence-weighted ensembling improved average HM from 87.52 to 89.28 (+1.77 points) over single-teacher PromptKD. Equal-probability ensembling also gained, reaching 88.88 (+1.37 points). Benefits were most significant on domain-shifted datasets like EuroSAT, where confidence weighting yielded a +5.78 HM increase. This indicates complementary supervision from multiple teachers is particularly effective under domain shift.

Key takeaway

For Machine Learning Engineers deploying vision-language models, especially when facing domain shift challenges, you should consider adopting multi-teacher prompt distillation. This approach, particularly with confidence-weighted ensembling, can significantly improve student model performance and robustness. You can achieve average HM gains of +1.77 points, with up to +5.78 points on domain-shifted datasets like EuroSAT. This is by utilizing complementary supervision from diverse teachers. This strategy offers a clear path to more efficient yet performant VLM deployments.

Key insights

Multi-teacher prompt distillation significantly improves VLM student performance, particularly under domain shift, through complementary teacher supervision.

Principles

Method

TheProfessor distills VLMs using a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14 with pre-computed logits. It evaluates confidence-weighted and equal-probability ensembling strategies.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.