ResNets, Hyper-Connections, and Manifold Constraints: A Story about Stability
Summary
This article traces the evolution of neural network stability techniques, beginning with the "degradation problem" observed in 2015 when Microsoft researchers found 34-layer networks performed worse than 18-layer ones. The solution, proposed by Kaiming He and colleagues, was the residual connection, which allows layers to learn only a correction to their input, effectively hardcoding an identity mapping. While ResNets scaled to 152 layers and won the 2015 ImageNet challenge, ByteDance researchers later identified "representation collapse" in very deep residual networks, leading to the 2024 introduction of Hyper-Connections (HC). HC maintains multiple parallel residual streams that can mix and interact. However, DeepSeek found that scaling HC to billion-parameter frontier models in mid-2025 caused training instability due to unconstrained learned mixing matrices. DeepSeek's solution, Manifold-Constrained Hyper-Connections (mHC), geometrically constrains these routing matrices to a Birkhoff polytope, ensuring stability by conserving total signal mass and preventing gradient explosion or vanishing.
Key takeaway
For AI Scientists designing or training frontier-scale models, understanding the evolution from residual connections to Manifold-Constrained Hyper-Connections (mHC) is critical. Your model's stability at scale depends on how information is routed and constrained. Employing mHC's geometric constraints on routing matrices can prevent gradient instability and representation collapse, enabling successful training of billion-parameter networks where unconstrained Hyper-Connections would fail.
Key insights
Architectural stability techniques like residual and hyper-connections are crucial for scaling deep neural networks.
Principles
- Deeper networks require explicit stability mechanisms.
- Information flow can be learned but needs constraints.
- Geometric constraints ensure training stability at scale.
Method
Manifold-Constrained Hyper-Connections (mHC) project learned routing matrices onto a Birkhoff polytope, ensuring they are doubly stochastic to conserve signal mass and stabilize gradients during training.
In practice
- Implement residual connections for basic deep network stability.
- Consider Hyper-Connections for richer representation learning.
- Apply mHC for stable training of frontier-scale models.
Topics
- Residual Connections
- Hyper-Connections
- Manifold-Constrained Hyper-Connections
- Deep Learning Stability
- Neural Network Architectures
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.