CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
Summary
CausalMoE is a billion-scale multimodal foundation model designed for Granger Causal Discovery (GCD) in time series, addressing the limitations of "one-size-fits-all" neural methods that struggle with distribution shifts. It introduces a Pattern-Routed Mixture of Heterogeneous Experts (MoHE) to dynamically identify latent temporal patterns and route time-series patches to specialized domain experts, decoupling regime-specific mechanisms. CausalMoE integrates Large Language Models (LLMs) and Vision-Language Models (VLMs) to align numerical signals with textual and visual priors, regularizing causal estimation. A Causality-Aware Self-Attention mechanism ensures interpretable, sparse graph recovery via proximal optimization. Extensive experiments show CausalMoE achieves state-of-the-art performance on benchmarks like VAR, Lorenz-96, fMRI, DREAM-3, and DREAM-4, demonstrating strong generalization, especially in few-shot settings where traditional methods fail.
Key takeaway
For Machine Learning Engineers developing causal inference systems, CausalMoE offers a robust solution for Granger Causal Discovery, particularly in data-scarce or heterogeneous time series environments. Its ability to leverage multimodal priors and adapt to regime shifts means you can achieve reliable causal structures with significantly less training data, outperforming traditional methods. Consider integrating this approach to enhance the accuracy and interpretability of your temporal causal models.
Key insights
CausalMoE uses multimodal experts and pattern-routed architecture for robust Granger Causal Discovery in heterogeneous time series.
Principles
- Explicitly model patch-level heterogeneity for reliable causal discovery.
- Multimodal priors from LLMs/VLMs regularize causal estimation.
- Causality-Aware Self-Attention yields sparse, interpretable causal graphs.
Method
CausalMoE employs Multimodal Patch Encoding, Patch-Specific Pattern Routing to heterogeneous experts (Semantic, Multimodal, Temporal Frequency, Multiscale Temporal), and Causality-Aware Self-Attention with proximal optimization for sparse graph recovery.
In practice
- Integrate LLMs and VLMs to enrich time series representations.
- Apply Mixture of Experts for adaptive modeling of temporal heterogeneity.
- Use variable-wise attention for direct causal interpretation.
Topics
- Granger Causal Discovery
- Multimodal Foundation Models
- Time Series Analysis
- Mixture-of-Experts
- Large Language Models
- Vision-Language Models
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.