DnA: Denoising Attention for Visual Tasks
Summary
DnA (Denoising Attention) is a novel approach designed to mitigate noisy attention patterns inherent in standard softmax multihead attention (MHA) within visual perception tasks. This method employs a positive query to identify correct class features and a negative query to pinpoint closely associated but irrelevant image features. DnA then projects these interactions into two distinct subspaces, enhancing separation and discriminability through larger principal angles. When integrated with a ViT-B backbone, DnA demonstrates an absolute gain of 0.8% on ImageNet-1K. Furthermore, it shows performance improvements across various visual understanding tasks, including a 1.8% gain with video transformers and a 0.5% gain with video LLMs, validating its denoising effect and design choices.
Key takeaway
For Machine Learning Engineers optimizing visual perception models, consider integrating Denoising Attention (DnA) to address noisy softmax MHA patterns. Your models could achieve significant performance gains, such as 0.8% on ImageNet-1K with a ViT-B backbone, or up to 1.8% in video transformer applications. This approach offers a direct method to enhance feature discriminability by explicitly separating relevant and irrelevant visual cues, potentially improving your model's robustness and accuracy across various visual tasks.
Key insights
DnA improves visual attention by separating relevant and irrelevant features into distinct subspaces.
Principles
- Softmax MHA can produce noisy attention patterns.
- Separating feature interactions enhances discriminability.
- Larger principal angles promote subspace separation.
Method
DnA identifies relevant features with a positive query and irrelevant ones with a negative query, then projects these into two distinct, separated subspaces.
In practice
- Apply DnA to ViT-B backbones.
- Improve video transformers performance.
- Enhance video LLM capabilities.
Topics
- Denoising Attention
- Multihead Attention
- Visual Perception
- Video Transformers
- ImageNet-1K
- ViT-B
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.