Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Summary
Nous Research has introduced Token Superposition Training (TST), a novel two-phase modification to the standard pre-training loop for Large Language Models (LLMs) that significantly accelerates the process without altering the tokenizer, architecture, or inference behavior. TST averages 's' contiguous token embeddings into a single latent s-token during Phase 1, training with a multi-hot cross-entropy loss against the next bag of tokens. Phase 2 then reverts to standard next-token prediction from the same checkpoint, with TST code removed. This method achieves up to 2.5x speedup, with a 3B dense model reaching a loss of 2.676 in 247 B200-hours compared to 443 B200-hours for baseline, and a 10B-A1B MoE model completing in 4,768 B200-hrs versus 12,311 B200-hrs. Each TST step maintains equal FLOPs to the baseline by increasing data sequence length by 's' times, not batch size. Optimal bag sizes 's' range from 3-8 for 270M models, 6-10 for 600M models, and 16 for 10B models, with a step ratio 'r' between 0.2 and 0.4. Re-initializing the embedding or LM head at the phase boundary degrades performance.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM pre-training, consider integrating Token Superposition Training (TST) into your workflow. This method offers substantial speedups, up to 2.5x, across models from 270M to 10B parameters, without requiring architectural changes. Ensure you maintain the embedding and LM head weights across the TST phase boundary to preserve performance gains and achieve optimal results.
Key insights
Token Superposition Training accelerates LLM pre-training by modifying the training loop without changing model architecture or inference.
Principles
- Maintain equal FLOPs per step by adjusting sequence length.
- Avoid re-initializing embedding or LM heads at phase transitions.
Method
TST involves a two-phase pre-training loop: Phase 1 averages 's' token embeddings into a single latent s-token, training with multi-hot cross-entropy loss; Phase 2 reverts to standard next-token prediction.
In practice
- Apply TST for 1.8x to 2.5x LLM pre-training speedup.
- Use bag sizes s=3-16 depending on model size.
- Set TST step ratio r between 0.2 and 0.4.
Topics
- Token Superposition Training
- LLM Pre-training
- Training Efficiency
- Multi-hot Cross-Entropy
- Large Language Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.