MoRFI: Monotonic Sparse Autoencoder Feature Identification
Summary
A study on large language models (LLMs) investigates how fine-tuning on new factual knowledge contributes to hallucinations, particularly in closed-book question answering. Researchers fine-tuned Llama 3.1 8B, Gemma 2 9B, and Mistral 7B v03 on seven single QA datasets, varying the percentage of new knowledge and training epochs. They found that increasing new knowledge and prolonged training incrementally exacerbated hallucinations. To understand the underlying mechanisms, the study utilized pre-trained sparse autoencoders (SAEs) to analyze residual stream activations. They introduced Monotonic Relationship Feature Identification (MoRFI), a method for identifying SAE features that respond monotonically to controlled fine-tuning data mixtures, revealing latent directions causally linked to hallucinations. The findings indicate that exposure to unknown facts disrupts knowledge retrieval along specific residual stream directions, which MoRFI can reliably discover and recover through single-latent interventions.
Key takeaway
For research scientists investigating LLM reliability, understanding that fine-tuning on new facts can disrupt existing knowledge and increase hallucinations is critical. You should consider using methods like MoRFI to identify and potentially mitigate these latent directions in the residual stream, especially when introducing new information post-pre-training. This insight can guide strategies for more robust model updates.
Key insights
Fine-tuning LLMs on new facts disrupts existing knowledge retrieval, increasing hallucinations via specific latent directions.
Principles
- New knowledge fine-tuning increases hallucinations.
- Prolonged training exacerbates hallucination effects.
Method
MoRFI identifies causally relevant SAE features by filtering those responding monotonically to controlled fine-tuning data mixtures, revealing latent directions in LLM residual streams.
In practice
- Analyze residual stream activations with SAEs.
- Use MoRFI to pinpoint hallucination-causing latents.
Topics
- LLM Hallucinations
- Sparse Autoencoders
- MoRFI
- Residual Stream Analysis
- Knowledge Retrieval
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.