Noise-Aware Visual Representation Learning for Medical Visual Question Answering

2024-08-05 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

A new noise-aware framework enhances Medical Visual Question Answering (Med-VQA) by addressing noise in visual representations. This two-stage approach integrates a denoising autoencoder (DAE) before visual embeddings are mapped to a Large Language Model (LLM). In Stage 1, the DAE is pretrained to reconstruct clean visual embeddings from Gaussian-corrupted inputs, learning robust representations. Stage 2 projects these robust 128-dimensional latent representations from a frozen CLIP ViT-B/32 encoder via a 3-layer MLP into visual prefix tokens for a frozen GPT-2 XL LLM, optionally using LoRA for parameter-efficient fine-tuning. Evaluated on SLAKE and PathVQA benchmarks, the framework significantly improved robustness under noisy conditions. For instance, on SLAKE with LoRA, average accuracy under noise increased from 0.642 to 0.735, demonstrating more stable and effective Med-VQA answer generation.

Key takeaway

For AI Scientists developing robust Med-VQA systems, you should integrate a denoising autoencoder into your visual processing pipeline. This approach, particularly with moderate noise injection during pretraining (e.g., Gaussian σ=0.50), demonstrably improves model resilience to noisy medical image embeddings. Consider this two-stage strategy to enhance the stability and accuracy of your models, especially when working with real-world medical data prone to acquisition artifacts or variations.

Key insights

Denoising visual embeddings before LLM projection significantly improves Med-VQA robustness against input noise.

Principles

Explicitly denoising visual embeddings enhances Med-VQA stability.
Moderate noise during DAE pretraining optimizes robustness.
Decoupling denoising from generative alignment is effective.

Method

A two-stage process: first, pretrain a denoising autoencoder with a Smooth L1 objective to reconstruct clean visual embeddings from Gaussian-corrupted inputs; then, project the robust latent representations into LLM prefix tokens for VQA.

In practice

Implement a DAE for visual feature preprocessing in VQA.
Use Gaussian noise with σ=0.50 for DAE pretraining.
Apply LoRA for efficient LLM adaptation in Med-VQA.

Topics

Medical Visual Question Answering
Denoising Autoencoders
Visual Representation Learning
Large Language Models
Parameter-Efficient Fine-Tuning
CLIP Encoder

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.