[P] Bypassing CoreML to natively train a 110M Transformer on the Apple Neural Engine (Orion)
Summary
ORION is introduced as the first open end-to-end system enabling direct Apple Neural Engine (ANE) execution and stable multi-step training for large language models, bypassing CoreML's limitations. The project overcame significant challenges, including 11 newly discovered undocumented ANE programming constraints and a "numerical stability ceiling" that caused NaN divergence in previous attempts. Solutions involved a custom compiler with five optimization passes and fixes for stale programs, fp16 overflow, and corrupted weights through deferred compilation, activation clamping, and gradient sanitization. The system achieves 170+ tokens/s for GPT-2 124M inference on an M4 Max and demonstrated stable training of a 110M-parameter Transformer, reducing loss from 12.29 to 6.19 over 1,000 steps with zero NaN occurrences, despite a current bottleneck where each weight update requires a ~4.2s recompilation. This work proves the feasibility of mathematically stable gradient descent directly on Apple's NPU, opening avenues for future advancements like weight patching or incremental compilation.
Key takeaway
ORION is the first open system enabling direct, stable multi-step training of 110M-parameter Transformers on Apple's ANE, bypassing CoreML's opaque abstractions. It achieves a loss drop from 12.29 to 6.19 over 1,000 steps with zero NaN divergence by solving critical fp16 overflow and weight corruption issues via a custom compiler and deferred compilation pipeline. While current recompilation overhead is ~4.2s per step, this validates practical on-device gradient descent on Apple's NPU, opening avenues for future weight patching and incremental compilation.
Topics
- Apple Neural Engine
- On-device Training
- Transformer Models
- Low-level ML
- Custom Compilers
Code references
Best for: Machine Learning Engineer, NLP Engineer, AI Scientist, AI Engineer, AI Researcher, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.