Three-Phase Transformer

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Three-Phase Transformer (3PT) is a novel residual-stream structural prior designed for decoder-only Transformers, built on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. This architecture functions as a self-stabilizing equilibrium between scrambling and re-imposition, rather than a modular add-on. Key features include partitioning the hidden vector into N cyclic channels, each managed by phase-respecting operations like per-channel RMSNorm and 2D Givens rotations. It also incorporates a one-dimensional DC subspace, orthogonal to the channels, into which a fixed Gabriel's horn profile is injected as an absolute-position side-channel. The canonical N=3 configuration draws inspiration from balanced three-phase AC systems. On WikiText-103, a 123M parameter 3PT model achieved a -7.20% perplexity reduction (-2.62% bits-per-byte) compared to a RoPE-Only baseline, with a 1.93x step-count convergence speedup.

Key takeaway

For research scientists optimizing Transformer architectures, consider integrating the Three-Phase Transformer (3PT) design. Its channel-partitioned residual stream and DC subspace injection offer significant perplexity improvements and faster convergence, potentially reducing training costs and improving model efficiency. You should explore N=3 as a strong starting point, though N=1 also performs comparably at larger scales.

Key insights

3PT introduces a self-stabilizing, channel-partitioned residual stream for Transformers, enhancing performance and convergence.

Principles

Method

Partition hidden vectors into N cyclic channels, apply phase-respecting ops, inject a Gabriel's horn profile into an orthogonal DC subspace.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.