Hellinger Multimodal Variational Autoencoders

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Multimodal variational autoencoders (VAEs) are crucial for weakly supervised generative learning. This work introduces HELVAE, a novel multimodal VAE based on Hellinger aggregation, derived from Hölder pooling with α=0.5. HELVAE avoids sub-sampling during training, a common issue in existing models like MMVAE that limits generative quality. Empirically, HELVAE learns more expressive latent representations, shows improved performance across modalities, and achieves superior trade-offs between generative coherence and quality. It outperforms leading multimodal VAE models on benchmark datasets including PolyMNIST (five modalities), CUB Image-Captions, and bimodal CelebA, while also being computationally more efficient.

Key takeaway

For Machine Learning Engineers developing multimodal generative models, HELVAE offers a robust alternative to existing VAEs. Its Hellinger aggregation method, which avoids sub-sampling, provides superior generative coherence and quality, especially when dealing with multiple modalities. You should consider integrating HELVAE or its MoHELVAE variant to achieve better latent representations and more semantically consistent cross-modal generation, particularly in scenarios where balancing quality and coherence is critical.

Key insights

HELVAE uses Hellinger aggregation from Hölder pooling (α=0.5) to improve multimodal VAE coherence and quality without sub-sampling.

Principles

Method

HELVAE aggregates unimodal Gaussian posteriors using Hellinger aggregation, a moment-matching approximation of Hölder pooling with α=0.5, projecting the pooled density onto a diagonal Gaussian. This avoids sub-sampling.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.