MoRFI: Monotonic Sparse Autoencoder Feature Identification

2026-04-29 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on large language models (LLMs) investigates how fine-tuning on new factual knowledge contributes to hallucinations, particularly in closed-book question answering. Researchers fine-tuned Llama 3.1 8B, Gemma 2 9B, and Mistral 7B v03 on seven single QA datasets, varying the percentage of new knowledge and training epochs. They found that increasing new knowledge and prolonged training incrementally exacerbated hallucinations. To understand the underlying mechanisms, the study utilized pre-trained sparse autoencoders (SAEs) to analyze residual stream activations. They introduced Monotonic Relationship Feature Identification (MoRFI), a method for identifying SAE features that respond monotonically to controlled fine-tuning data mixtures, revealing latent directions causally linked to hallucinations. The findings indicate that exposure to unknown facts disrupts knowledge retrieval along specific residual stream directions, which MoRFI can reliably discover and recover through single-latent interventions.

Key takeaway

For research scientists investigating LLM reliability, understanding that fine-tuning on new facts can disrupt existing knowledge and increase hallucinations is critical. You should consider using methods like MoRFI to identify and potentially mitigate these latent directions in the residual stream, especially when introducing new information post-pre-training. This insight can guide strategies for more robust model updates.

Key insights

Fine-tuning LLMs on new facts disrupts existing knowledge retrieval, increasing hallucinations via specific latent directions.

Principles

New knowledge fine-tuning increases hallucinations.
Prolonged training exacerbates hallucination effects.

Method

MoRFI identifies causally relevant SAE features by filtering those responding monotonically to controlled fine-tuning data mixtures, revealing latent directions in LLM residual streams.

In practice

Analyze residual stream activations with SAEs.
Use MoRFI to pinpoint hallucination-causing latents.

Topics

LLM Hallucinations
Sparse Autoencoders
MoRFI
Residual Stream Analysis
Knowledge Retrieval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.