Exposing biases, moods, personalities, and abstract concepts hidden in large language models

2026-02-19 · Source: MIT News - Data · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Researchers from MIT and the University of California San Diego have developed a novel method to identify and manipulate abstract concepts, such as biases, personalities, and moods, hidden within large language models (LLMs). Published on February 19, 2026, this technique can pinpoint specific connections within an LLM that encode a concept of interest. The method allows for "steering" these connections to either strengthen or weaken the concept in a model's output. The team successfully rooted out and steered over 500 general concepts, including "social influencer," "conspiracy theorist," "fear of marriage," and "fan of Boston," in various LLMs. For example, they enhanced the "conspiracy theorist" concept in a vision language model, prompting it to explain the "Blue Marble" image from Apollo 17 with a conspiratorial tone. This approach aims to expose hidden vulnerabilities and improve LLM safety and performance.

Key takeaway

For research scientists developing or deploying LLMs, understanding and controlling hidden abstract concepts is crucial. This new method provides a targeted way to expose and manipulate biases, personalities, or moods, allowing you to proactively address safety concerns or fine-tune model performance for specific applications. You should explore integrating this technique to build more robust and predictable LLMs, ensuring they align with desired ethical and functional parameters.

Key insights

A new method can identify and manipulate abstract concepts hidden within large language models to enhance or minimize their expression.

Principles

LLMs encode abstract concepts implicitly.
Targeted algorithms can extract specific features.
Modulating representations steers model behavior.

Method

The method trains Recursive Feature Machines (RFMs) to recognize numerical patterns in an LLM associated with a specific concept, then mathematically perturbs these patterns to modulate the concept's activity in model responses.

In practice

Identify and minimize LLM vulnerabilities.
Enhance specific traits like "brevity" or "reasoning."
Develop specialized, safer LLMs for tasks.

Topics

LLM Bias Detection
Concept Steering
Recursive Feature Machines
Model Interpretability
LLM Safety

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Data.