Layer-Specific Prompt Fusion Discovery via Differentiable Search in Vision Foundation Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A new method for visual prompt tuning, called Layer-Specific Prompt Fusion Discovery, addresses the fundamental problem of how learnable prompts fuse with image tokens in Vision Transformers (ViTs). Current approaches typically use single fusion schemes like concatenation or addition. This research formulates the task as a bi-level optimization problem, solved via differentiable architecture search, which jointly optimizes prompts and their fusion schemes. The search space is enriched with two novel fusion schemes: affine transformation and cross-attention, alongside the existing concatenation and addition. Extensive experiments on 34 datasets, including VTAB-1k, FGVC, and HTA, show consistent accuracy gains over prompt-tuning baselines. With a frozen ViT backbone, the method delivers a favorable accuracy-latency-parameter trade-off compared to VPT-Deep and recent variants, highlighting the critical role of hybrid fusion in utilizing ViT layer semantics.

Key takeaway

For Machine Learning Engineers adapting Vision Transformers with prompt tuning, your current reliance on single fusion schemes like concatenation or addition may limit performance. This research suggests that exploring hybrid prompt fusion, especially through differentiable search, can significantly improve accuracy and efficiency. You should investigate incorporating layer-specific fusion strategies, including affine transformation and cross-attention, to better utilize ViT layer semantics for superior task adaptation.

Key insights

Hybrid prompt fusion, discovered via differentiable search, significantly improves visual prompt tuning in ViTs.

Principles

Method

Formulate prompt tuning as a bi-level optimization problem. Solve using differentiable architecture search, jointly optimizing prompts and fusion schemes. Expand fusion options with affine transformation and cross-attention.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.