GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

GeMoE, or Gating Entropy-based Uncertainty-aware Adaptive Routing, introduces a novel dynamic routing strategy for Mixture of Experts (MoE)-based Large Vision-Language Models (LVLMs). Traditional MoE architectures often employ a Top-k static routing approach, which struggles to adapt to input variations and select an optimal number of experts, leading to inefficient resource use. GeMoE reframes token routing as an information encoding task, specifically as a Minimum Description Length (MDL) problem, by establishing a connection between MDL and gating entropy. This method explicitly models the trade-off between model complexity and performance, using gating entropy to assess token complexity and adaptively determine the number of experts each token should engage. Across various backbones and benchmarks, GeMoE maintains 99.5% average performance retention compared to static routing, while significantly improving average expert activation sparsity by 36.5%.

Key takeaway

For Machine Learning Engineers optimizing Large Vision-Language Models, GeMoE offers a significant advancement in resource efficiency. If you are currently using static Top-k routing in MoE architectures, consider evaluating GeMoE's dynamic, uncertainty-aware approach. This method allows you to achieve 99.5% performance retention. Simultaneously, it improves expert activation sparsity by 36.5%, potentially reducing inference costs and computational overhead without sacrificing model quality.

Key insights

GeMoE adaptively routes tokens in MoE models by using gating entropy to manage complexity and uncertainty.

Principles

Method

GeMoE frames dynamic routing as an MDL problem, using gating entropy to assess token complexity and adaptively determine the number of experts each token engages.

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.