Expert Masking for More Efficient Expert Offloading

2024-03-06 · Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

This week's review highlights three key papers in AI. The first, "Why Fine-Tuning Encourages Hallucinations and How to Fix It," frames SFT-induced hallucination as a continual-learning problem, where new factual knowledge degrades previously known facts. It proposes fixes like restricting trainable parameters or using self-distillation to mitigate this degradation from approximately 15% to 3%. The second paper, "Temporally Extended Mixture-of-Experts Models," addresses the challenge of frequent expert routing in sparse MoE LLMs by making expert selection persistent over multiple tokens, aiming to improve offloading and prefetching efficiency. The third, "Micro Language Models Enable Instant Responses," introduces a device-cloud split for latency-sensitive assistants, where a small on-device model generates the first few words (e.g., 4-8 words in 55 ms on Orange Pi hardware) and a larger cloud model completes the response, significantly reducing perceived latency.

Key takeaway

For NLP Engineers developing or deploying LLMs, understanding the mechanisms behind fine-tuning-induced hallucinations is crucial. When fine-tuning for task adaptation or alignment, constrain factual plasticity by limiting trainable parameters. If new factual knowledge must be added, implement self-distillation and rigorously evaluate not only target-domain gains but also potential regression on previously reliable facts, especially those semantically close to new training data, to maintain model integrity.

Key insights

Fine-tuning can degrade existing factual knowledge, but strategies like parameter restriction or self-distillation can mitigate this.

Principles

SFT-induced hallucination is a continual-learning problem.
Expert selection persistence improves MoE efficiency.
Device-cloud splits reduce perceived latency for assistants.

Method

For fine-tuning, restrict trainable parameters or use self-distillation. For MoE, use a controller to make expert selection persistent. For assistants, use a micro-LM for initial words and a cloud model for continuation.

In practice

Constrain factual plasticity during fine-tuning for alignment.
Evaluate regression on known facts during knowledge updates.
Commit 4-8 words from a local model for cloud handoff.

Topics

LLM Hallucination
Fine-Tuning Strategies
Mixture-of-Experts
Expert Routing Efficiency
Micro Language Models

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.