[P] Bypassing CoreML to natively train a 110M Transformer on the Apple Neural Engine (Orion)

2026-03-05 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, short

Summary

ORION is introduced as the first open end-to-end system enabling direct Apple Neural Engine (ANE) execution and stable multi-step training for large language models, bypassing CoreML's limitations. The project overcame significant challenges, including 11 newly discovered undocumented ANE programming constraints and a "numerical stability ceiling" that caused NaN divergence in previous attempts. Solutions involved a custom compiler with five optimization passes and fixes for stale programs, fp16 overflow, and corrupted weights through deferred compilation, activation clamping, and gradient sanitization. The system achieves 170+ tokens/s for GPT-2 124M inference on an M4 Max and demonstrated stable training of a 110M-parameter Transformer, reducing loss from 12.29 to 6.19 over 1,000 steps with zero NaN occurrences, despite a current bottleneck where each weight update requires a ~4.2s recompilation. This work proves the feasibility of mathematically stable gradient descent directly on Apple's NPU, opening avenues for future advancements like weight patching or incremental compilation.

Key takeaway

ORION is the first open system enabling direct, stable multi-step training of 110M-parameter Transformers on Apple's ANE, bypassing CoreML's opaque abstractions. It achieves a loss drop from 12.29 to 6.19 over 1,000 steps with zero NaN divergence by solving critical fp16 overflow and weight corruption issues via a custom compiler and deferred compilation pipeline. While current recompilation overhead is ~4.2s per step, this validates practical on-device gradient descent on Apple's NPU, opening avenues for future weight patching and incremental compilation.

Topics

Apple Neural Engine
On-device Training
Transformer Models
Low-level ML
Custom Compilers

Code references

mechramc/Orion

Best for: Machine Learning Engineer, NLP Engineer, AI Scientist, AI Engineer, AI Researcher, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.