The Information-Theoretic Benefit of Shared Representations under Orthogonality Constraints

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

Modern deep learning architectures, particularly multi-task and multi-modal systems, often combine pretrained foundation models with task-specific fine-tuned models. This paper investigates the parametric complexity of joint versus separate approximation under structural constraints like orthogonality, which are increasingly relevant in deep learning. The authors prove lower and upper bounds on the description-length for both separate and joint approximation classes in uniform norm. They construct orthogonal functions using a shared hard feature, realized by a Rademacher-Haar wavelet series, combined with Sawtooth-Walsh readouts. Utilizing an information-theoretic framework, the research demonstrates a sharp gap in optimal approximation rates, showing that joint approximation requires strictly fewer bits in compositional architectures when tasks share a latent hard feature. This theoretical separation is then realized in a neural network model using Heaviside activations, providing insight into the description-length efficiency of compositional multi-output architectures and how neural networks maintain expressivity under geometric constraints.

Key takeaway

For AI Scientists designing multi-task or multi-modal deep learning architectures, this research indicates that joint approximation under orthogonality constraints can significantly reduce parametric complexity. You should prioritize compositional designs that exploit shared latent hard features, as this approach requires strictly fewer bits for optimal approximation. Consider implementing Heaviside activations in neural networks to retain expressivity when geometric constraints are necessary. This strategy enhances efficiency and theoretical understanding of model design.

Key insights

Joint approximation significantly reduces description-length in multi-task deep learning when tasks share a latent hard feature.

Principles

Method

The paper builds orthogonal functions by composing a shared Rademacher-Haar wavelet series feature with Sawtooth-Walsh readouts, then realizes this separation in a neural network using Heaviside activations via triangle-wave approximation.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.