Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models
Summary
A new method for visual prompt tuning, called Layer-Specific Prompt Fusion Discovery, addresses the fundamental problem of how learnable prompts fuse with image tokens in Vision Transformers (ViTs). Current approaches typically use single fusion schemes like concatenation or addition. This research formulates the task as a bi-level optimization problem, solved via differentiable architecture search, which jointly optimizes prompts and their fusion schemes. The search space is enriched with two novel fusion schemes: affine transformation and cross-attention, alongside the existing concatenation and addition. Extensive experiments on 34 datasets, including VTAB-1k, FGVC, and HTA, show consistent accuracy gains over prompt-tuning baselines. With a frozen ViT backbone, the method delivers a favorable accuracy-latency-parameter trade-off compared to VPT-Deep and recent variants, highlighting the critical role of hybrid fusion in utilizing ViT layer semantics.
Key takeaway
For Machine Learning Engineers adapting Vision Transformers with prompt tuning, your current reliance on single fusion schemes like concatenation or addition may limit performance. This research suggests that exploring hybrid prompt fusion, especially through differentiable search, can significantly improve accuracy and efficiency. You should investigate incorporating layer-specific fusion strategies, including affine transformation and cross-attention, to better utilize ViT layer semantics for superior task adaptation.
Key insights
Hybrid prompt fusion, discovered via differentiable search, significantly improves visual prompt tuning in ViTs.
Principles
- How prompts fuse with image tokens is critical.
- Hybrid fusion schemes can utilize layer semantics.
- Differentiable search optimizes prompts and fusion.
Method
Formulate prompt tuning as a bi-level optimization problem. Solve using differentiable architecture search, jointly optimizing prompts and fusion schemes. Expand fusion options with affine transformation and cross-attention.
In practice
- Explore hybrid fusion beyond concatenation/addition.
- Consider differentiable search for prompt optimization.
- Evaluate affine transformation and cross-attention.
Topics
- Visual Prompt Tuning
- Vision Transformers
- Differentiable Architecture Search
- Prompt Fusion
- Parameter-Efficient Fine-Tuning
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.