Noise-Aware Visual Representation Learning for Medical Visual Question Answering
Summary
A new noise-aware framework enhances Medical Visual Question Answering (Med-VQA) by addressing noise in visual representations. This two-stage approach integrates a denoising autoencoder (DAE) before visual embeddings are mapped to a Large Language Model (LLM). In Stage 1, the DAE is pretrained to reconstruct clean visual embeddings from Gaussian-corrupted inputs, learning robust representations. Stage 2 projects these robust 128-dimensional latent representations from a frozen CLIP ViT-B/32 encoder via a 3-layer MLP into visual prefix tokens for a frozen GPT-2 XL LLM, optionally using LoRA for parameter-efficient fine-tuning. Evaluated on SLAKE and PathVQA benchmarks, the framework significantly improved robustness under noisy conditions. For instance, on SLAKE with LoRA, average accuracy under noise increased from 0.642 to 0.735, demonstrating more stable and effective Med-VQA answer generation.
Key takeaway
For AI Scientists developing robust Med-VQA systems, you should integrate a denoising autoencoder into your visual processing pipeline. This approach, particularly with moderate noise injection during pretraining (e.g., Gaussian σ=0.50), demonstrably improves model resilience to noisy medical image embeddings. Consider this two-stage strategy to enhance the stability and accuracy of your models, especially when working with real-world medical data prone to acquisition artifacts or variations.
Key insights
Denoising visual embeddings before LLM projection significantly improves Med-VQA robustness against input noise.
Principles
- Explicitly denoising visual embeddings enhances Med-VQA stability.
- Moderate noise during DAE pretraining optimizes robustness.
- Decoupling denoising from generative alignment is effective.
Method
A two-stage process: first, pretrain a denoising autoencoder with a Smooth L1 objective to reconstruct clean visual embeddings from Gaussian-corrupted inputs; then, project the robust latent representations into LLM prefix tokens for VQA.
In practice
- Implement a DAE for visual feature preprocessing in VQA.
- Use Gaussian noise with σ=0.50 for DAE pretraining.
- Apply LoRA for efficient LLM adaptation in Med-VQA.
Topics
- Medical Visual Question Answering
- Denoising Autoencoders
- Visual Representation Learning
- Large Language Models
- Parameter-Efficient Fine-Tuning
- CLIP Encoder
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.