Hellinger Multimodal Variational Autoencoders

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Multimodal variational autoencoders (VAEs) are crucial for weakly supervised generative learning. This work introduces HELVAE, a novel multimodal VAE based on Hellinger aggregation, derived from Hölder pooling with α=0.5. HELVAE avoids sub-sampling during training, a common issue in existing models like MMVAE that limits generative quality. Empirically, HELVAE learns more expressive latent representations, shows improved performance across modalities, and achieves superior trade-offs between generative coherence and quality. It outperforms leading multimodal VAE models on benchmark datasets including PolyMNIST (five modalities), CUB Image-Captions, and bimodal CelebA, while also being computationally more efficient.

Key takeaway

For Machine Learning Engineers developing multimodal generative models, HELVAE offers a robust alternative to existing VAEs. Its Hellinger aggregation method, which avoids sub-sampling, provides superior generative coherence and quality, especially when dealing with multiple modalities. You should consider integrating HELVAE or its MoHELVAE variant to achieve better latent representations and more semantically consistent cross-modal generation, particularly in scenarios where balancing quality and coherence is critical.

Key insights

HELVAE uses Hellinger aggregation from Hölder pooling (α=0.5) to improve multimodal VAE coherence and quality without sub-sampling.

Principles

Hölder pooling with α=0.5 induces soft dependencies between experts.
Avoiding sub-sampling improves multimodal VAE generative quality.
Probabilistic opinion pooling generalizes PoE and MoE.

Method

HELVAE aggregates unimodal Gaussian posteriors using Hellinger aggregation, a moment-matching approximation of Hölder pooling with α=0.5, projecting the pooled density onto a diagonal Gaussian. This avoids sub-sampling.

In practice

Apply Hellinger aggregation for robust multimodal posterior approximation.
Consider MoHELVAE for enhanced coherence with small modality counts.

Topics

Multimodal VAEs
Hellinger Aggregation
Hölder Pooling
Generative Models
Latent Representations
Probabilistic Opinion Pooling

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.