CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield, a Korean foundation model, validates three techniques for compute-efficient language model training. Selective Ground Truth Token Training (SGT) focuses supervision on ~15% of output tokens, achieving 4.5x per-supervised-token efficiency by recovering ~67% of full-sequence loss reduction. Depth compression reduces a 48-layer, 1B-parameter transformer to 6 layers (227M) using recurrent unrolling, reaching a held-out loss of 2.934, comparable to a 566M dense model's 2.926, representing a 2.5x parameter reduction. Finally, a Mixture of Efficient Experts (MoEE) with two compressed models achieves a loss of 2.789, outperforming the best single compressed model's 2.926.

Key takeaway

For machine learning engineers developing large language models, CHERRY's validated techniques offer pathways to significantly reduce computational costs and model size. You should investigate Selective Ground Truth Token Training to optimize supervision, explore recurrent unrolling for depth compression, and consider implementing a Mixture of Efficient Experts to enhance performance with fewer active parameters. These methods can help you achieve competitive performance with substantially more efficient models.

Key insights

Efficient language model training is achieved through selective supervision, depth compression, and expert fusion.

Principles

Positive gradient coupling improves unsupervised tokens when gamma-bar > 0.72.
Natural language structure enables effective selective supervision.
Recurrent unrolling can restore performance in depth-compressed models.

Method

Depth compression involves averaging adjacent layers and restoring performance via learned recurrent unrolling. Multiple compressed models can be fused into a Mixture of Efficient Experts (MoEE) using multi-token prediction.

In practice

Concentrate supervision on ~15% semantically rich output tokens.
Compress 48-layer models to 6 layers using recurrent unrolling.
Combine compressed models into a MoEE for improved performance.

Topics

CHERRY
Language Models
Model Compression
Mixture of Experts
Selective Supervision
Recurrent Neural Networks
Korean Language Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.