MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

2026-05-31 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

MENTIS introduces a geometry-first framework designed to measure internal reorganization within large language models during preference alignment. Addressing the ambiguity of what alignment changes internally, which contributes to failures like jailbreaks, MENTIS compares instruction-tuned and preference-aligned models. It employs a layerwise covariance-based torsion norm (T1), a spectral torsion diagnostic (T2), and an Energy-Radiance-Activation (ERA) measure for depth localization. Applied across four 7-8B model pairs on LITMUS, the study reveals that alignment-induced change is selective, not uniform. Specifically, normative concepts exhibit larger torsion shifts than factual concepts, torsion negatively correlates with contextual entropy, and peak effects localize to architecture-specific mid-to-late layers. This consistent pattern across word-level, prompt-level, and model-level analyses indicates structured, depth-localized geometric signatures.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or evaluating aligned LLMs, understanding internal changes is crucial given persistent alignment failures. You should move beyond behavior-level metrics by analyzing geometric signatures like torsion shifts, particularly in mid-to-late layers. This approach can reveal how alignment selectively impacts normative versus factual concepts, informing more robust alignment strategies and diagnostics against vulnerabilities like jailbreaks.

Key insights

Alignment selectively reorganizes LLM internal geometry, with measurable torsion shifts concentrated in specific layers and concepts.

Principles

Alignment induces selective, not uniform, internal changes.
Normative concepts show greater torsion shifts than factual ones.
Torsion negatively correlates with contextual entropy.

Method

MENTIS compares instruction-tuned and preference-aligned models using a layerwise covariance-based torsion norm (T1), a spectral torsion diagnostic (T2), and an Energy-Radiance-Activation (ERA) measure for depth localization.

In practice

Evaluate alignment beyond behavior-level metrics.
Focus internal analysis on mid-to-late layers.
Differentiate normative vs. factual concept changes.

Topics

Preference Alignment
Large Language Models
Model Internals
Latent Space Geometry
Model Evaluation
Instruction Tuning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.