MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Summary
MENTIS introduces a geometry-first framework designed to measure internal reorganization within large language models during preference alignment. Addressing the ambiguity of what alignment changes internally, which contributes to failures like jailbreaks, MENTIS compares instruction-tuned and preference-aligned models. It employs a layerwise covariance-based torsion norm (T1), a spectral torsion diagnostic (T2), and an Energy-Radiance-Activation (ERA) measure for depth localization. Applied across four 7-8B model pairs on LITMUS, the study reveals that alignment-induced change is selective, not uniform. Specifically, normative concepts exhibit larger torsion shifts than factual concepts, torsion negatively correlates with contextual entropy, and peak effects localize to architecture-specific mid-to-late layers. This consistent pattern across word-level, prompt-level, and model-level analyses indicates structured, depth-localized geometric signatures.
Key takeaway
For AI Scientists and Machine Learning Engineers developing or evaluating aligned LLMs, understanding internal changes is crucial given persistent alignment failures. You should move beyond behavior-level metrics by analyzing geometric signatures like torsion shifts, particularly in mid-to-late layers. This approach can reveal how alignment selectively impacts normative versus factual concepts, informing more robust alignment strategies and diagnostics against vulnerabilities like jailbreaks.
Key insights
Alignment selectively reorganizes LLM internal geometry, with measurable torsion shifts concentrated in specific layers and concepts.
Principles
- Alignment induces selective, not uniform, internal changes.
- Normative concepts show greater torsion shifts than factual ones.
- Torsion negatively correlates with contextual entropy.
Method
MENTIS compares instruction-tuned and preference-aligned models using a layerwise covariance-based torsion norm (T1), a spectral torsion diagnostic (T2), and an Energy-Radiance-Activation (ERA) measure for depth localization.
In practice
- Evaluate alignment beyond behavior-level metrics.
- Focus internal analysis on mid-to-late layers.
- Differentiate normative vs. factual concept changes.
Topics
- Preference Alignment
- Large Language Models
- Model Internals
- Latent Space Geometry
- Model Evaluation
- Instruction Tuning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.