Expert Masking for More Efficient Expert Offloading

· Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

This week's review highlights three key papers in AI. The first, "Why Fine-Tuning Encourages Hallucinations and How to Fix It," frames SFT-induced hallucination as a continual-learning problem, where new factual knowledge degrades previously known facts. It proposes fixes like restricting trainable parameters or using self-distillation to mitigate this degradation from approximately 15% to 3%. The second paper, "Temporally Extended Mixture-of-Experts Models," addresses the challenge of frequent expert routing in sparse MoE LLMs by making expert selection persistent over multiple tokens, aiming to improve offloading and prefetching efficiency. The third, "Micro Language Models Enable Instant Responses," introduces a device-cloud split for latency-sensitive assistants, where a small on-device model generates the first few words (e.g., 4-8 words in 55 ms on Orange Pi hardware) and a larger cloud model completes the response, significantly reducing perceived latency.

Key takeaway

For NLP Engineers developing or deploying LLMs, understanding the mechanisms behind fine-tuning-induced hallucinations is crucial. When fine-tuning for task adaptation or alignment, constrain factual plasticity by limiting trainable parameters. If new factual knowledge must be added, implement self-distillation and rigorously evaluate not only target-domain gains but also potential regression on previously reliable facts, especially those semantically close to new training data, to maintain model integrity.

Key insights

Fine-tuning can degrade existing factual knowledge, but strategies like parameter restriction or self-distillation can mitigate this.

Principles

Method

For fine-tuning, restrict trainable parameters or use self-distillation. For MoE, use a controller to make expert selection persistent. For assistants, use a micro-LM for initial words and a cloud model for continuation.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.