ResNets, Hyper-Connections, and Manifold Constraints: A Story about Stability

· Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

This article traces the evolution of neural network stability techniques, beginning with the "degradation problem" observed in 2015 when Microsoft researchers found 34-layer networks performed worse than 18-layer ones. The solution, proposed by Kaiming He and colleagues, was the residual connection, which allows layers to learn only a correction to their input, effectively hardcoding an identity mapping. While ResNets scaled to 152 layers and won the 2015 ImageNet challenge, ByteDance researchers later identified "representation collapse" in very deep residual networks, leading to the 2024 introduction of Hyper-Connections (HC). HC maintains multiple parallel residual streams that can mix and interact. However, DeepSeek found that scaling HC to billion-parameter frontier models in mid-2025 caused training instability due to unconstrained learned mixing matrices. DeepSeek's solution, Manifold-Constrained Hyper-Connections (mHC), geometrically constrains these routing matrices to a Birkhoff polytope, ensuring stability by conserving total signal mass and preventing gradient explosion or vanishing.

Key takeaway

For AI Scientists designing or training frontier-scale models, understanding the evolution from residual connections to Manifold-Constrained Hyper-Connections (mHC) is critical. Your model's stability at scale depends on how information is routed and constrained. Employing mHC's geometric constraints on routing matrices can prevent gradient instability and representation collapse, enabling successful training of billion-parameter networks where unconstrained Hyper-Connections would fail.

Key insights

Architectural stability techniques like residual and hyper-connections are crucial for scaling deep neural networks.

Principles

Method

Manifold-Constrained Hyper-Connections (mHC) project learned routing matrices onto a Birkhoff polytope, ensuring they are doubly stochastic to conserve signal mass and stabilize gradients during training.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.