Noise-Aware Visual Representation Learning for Medical Visual Question Answering

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

A new noise-aware framework enhances Medical Visual Question Answering (Med-VQA) by addressing noise in visual representations. This two-stage approach integrates a denoising autoencoder (DAE) before visual embeddings are mapped to a Large Language Model (LLM). In Stage 1, the DAE is pretrained to reconstruct clean visual embeddings from Gaussian-corrupted inputs, learning robust representations. Stage 2 projects these robust 128-dimensional latent representations from a frozen CLIP ViT-B/32 encoder via a 3-layer MLP into visual prefix tokens for a frozen GPT-2 XL LLM, optionally using LoRA for parameter-efficient fine-tuning. Evaluated on SLAKE and PathVQA benchmarks, the framework significantly improved robustness under noisy conditions. For instance, on SLAKE with LoRA, average accuracy under noise increased from 0.642 to 0.735, demonstrating more stable and effective Med-VQA answer generation.

Key takeaway

For AI Scientists developing robust Med-VQA systems, you should integrate a denoising autoencoder into your visual processing pipeline. This approach, particularly with moderate noise injection during pretraining (e.g., Gaussian σ=0.50), demonstrably improves model resilience to noisy medical image embeddings. Consider this two-stage strategy to enhance the stability and accuracy of your models, especially when working with real-world medical data prone to acquisition artifacts or variations.

Key insights

Denoising visual embeddings before LLM projection significantly improves Med-VQA robustness against input noise.

Principles

Method

A two-stage process: first, pretrain a denoising autoencoder with a Smooth L1 objective to reconstruct clean visual embeddings from Gaussian-corrupted inputs; then, project the robust latent representations into LLM prefix tokens for VQA.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.