MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

MENTIS introduces a geometry-first framework designed to measure internal reorganization within large language models during preference alignment. Addressing the ambiguity of what alignment changes internally, which contributes to failures like jailbreaks, MENTIS compares instruction-tuned and preference-aligned models. It employs a layerwise covariance-based torsion norm (T1), a spectral torsion diagnostic (T2), and an Energy-Radiance-Activation (ERA) measure for depth localization. Applied across four 7-8B model pairs on LITMUS, the study reveals that alignment-induced change is selective, not uniform. Specifically, normative concepts exhibit larger torsion shifts than factual concepts, torsion negatively correlates with contextual entropy, and peak effects localize to architecture-specific mid-to-late layers. This consistent pattern across word-level, prompt-level, and model-level analyses indicates structured, depth-localized geometric signatures.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or evaluating aligned LLMs, understanding internal changes is crucial given persistent alignment failures. You should move beyond behavior-level metrics by analyzing geometric signatures like torsion shifts, particularly in mid-to-late layers. This approach can reveal how alignment selectively impacts normative versus factual concepts, informing more robust alignment strategies and diagnostics against vulnerabilities like jailbreaks.

Key insights

Alignment selectively reorganizes LLM internal geometry, with measurable torsion shifts concentrated in specific layers and concepts.

Principles

Method

MENTIS compares instruction-tuned and preference-aligned models using a layerwise covariance-based torsion norm (T1), a spectral torsion diagnostic (T2), and an Energy-Radiance-Activation (ERA) measure for depth localization.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.