CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield
Summary
CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield, a Korean foundation model, validates three techniques for compute-efficient language model training. Selective Ground Truth Token Training (SGT) focuses supervision on ~15% of output tokens, achieving 4.5x per-supervised-token efficiency by recovering ~67% of full-sequence loss reduction. Depth compression reduces a 48-layer, 1B-parameter transformer to 6 layers (227M) using recurrent unrolling, reaching a held-out loss of 2.934, comparable to a 566M dense model's 2.926, representing a 2.5x parameter reduction. Finally, a Mixture of Efficient Experts (MoEE) with two compressed models achieves a loss of 2.789, outperforming the best single compressed model's 2.926.
Key takeaway
For machine learning engineers developing large language models, CHERRY's validated techniques offer pathways to significantly reduce computational costs and model size. You should investigate Selective Ground Truth Token Training to optimize supervision, explore recurrent unrolling for depth compression, and consider implementing a Mixture of Efficient Experts to enhance performance with fewer active parameters. These methods can help you achieve competitive performance with substantially more efficient models.
Key insights
Efficient language model training is achieved through selective supervision, depth compression, and expert fusion.
Principles
- Positive gradient coupling improves unsupervised tokens when gamma-bar > 0.72.
- Natural language structure enables effective selective supervision.
- Recurrent unrolling can restore performance in depth-compressed models.
Method
Depth compression involves averaging adjacent layers and restoring performance via learned recurrent unrolling. Multiple compressed models can be fused into a Mixture of Efficient Experts (MoEE) using multi-token prediction.
In practice
- Concentrate supervision on ~15% semantically rich output tokens.
- Compress 48-layer models to 6 layers using recurrent unrolling.
- Combine compressed models into a MoEE for improved performance.
Topics
- CHERRY
- Language Models
- Model Compression
- Mixture of Experts
- Selective Supervision
- Recurrent Neural Networks
- Korean Language Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.