The Information-Theoretic Benefit of Shared Representations under Orthogonality Constraints
Summary
Modern deep learning architectures, particularly multi-task and multi-modal systems, often combine pretrained foundation models with task-specific fine-tuned models. This paper investigates the parametric complexity of joint versus separate approximation under structural constraints like orthogonality, which are increasingly relevant in deep learning. The authors prove lower and upper bounds on the description-length for both separate and joint approximation classes in uniform norm. They construct orthogonal functions using a shared hard feature, realized by a Rademacher-Haar wavelet series, combined with Sawtooth-Walsh readouts. Utilizing an information-theoretic framework, the research demonstrates a sharp gap in optimal approximation rates, showing that joint approximation requires strictly fewer bits in compositional architectures when tasks share a latent hard feature. This theoretical separation is then realized in a neural network model using Heaviside activations, providing insight into the description-length efficiency of compositional multi-output architectures and how neural networks maintain expressivity under geometric constraints.
Key takeaway
For AI Scientists designing multi-task or multi-modal deep learning architectures, this research indicates that joint approximation under orthogonality constraints can significantly reduce parametric complexity. You should prioritize compositional designs that exploit shared latent hard features, as this approach requires strictly fewer bits for optimal approximation. Consider implementing Heaviside activations in neural networks to retain expressivity when geometric constraints are necessary. This strategy enhances efficiency and theoretical understanding of model design.
Key insights
Joint approximation significantly reduces description-length in multi-task deep learning when tasks share a latent hard feature.
Principles
- Orthogonality constraints impact approximation rates.
- Shared latent features improve efficiency.
- Compositional architectures are description-length efficient.
Method
The paper builds orthogonal functions by composing a shared Rademacher-Haar wavelet series feature with Sawtooth-Walsh readouts, then realizes this separation in a neural network using Heaviside activations via triangle-wave approximation.
In practice
- Design multi-output models with shared features.
- Consider Heaviside activations for geometric constraints.
- Apply Rademacher-Haar wavelets for hard features.
Topics
- Multi-task Learning
- Orthogonality Constraints
- Information Theory
- Deep Learning Architectures
- Rademacher-Haar Wavelets
- Parametric Complexity
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.