Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
Summary
A study on instruction-tuned large language models (LLMs) like Gemma-3-12B-IT, Llama-3.1-8B-Instruct, and Qwen2.5-14B-Instruct reveals that while they exhibit behavioral fairness in high-stakes decisions like mortgage underwriting, they retain and amplify biased associations in their internal representations. Using a synthetic dataset of 1500 paired mortgage applications differing only in racially-associated names, the researchers found no output-level bias (e.g., approval rates of 27.27% for White-associated names vs. 27.13% for Black-associated names in Gemma-3-12B-IT). However, internal demographic representations increased monotonically from 0 to ~1200 in Gemma-3's penultimate layer. Through activation steering and cross-layer interventions, this suppressed information was shown to be decision-relevant, causing near-complete decision reversals when reinjected at critical layers. This latent bias is asymmetric, affecting decisions in one demographic direction more than the reverse, and is susceptible to adversarial prompt engineering and parameter-efficient fine-tuning with fewer than 6,000 parameters.
Key takeaway
For CTOs and VPs of Engineering deploying LLMs in high-stakes financial services, relying solely on output-based fairness audits is insufficient and risky. You must integrate mechanistic interpretability techniques, such as activation steering and representational analysis, into your AI governance frameworks. This will help detect hidden, asymmetric biases that could be exploited through adversarial prompting or minimal fine-tuning, ensuring robust and truly fair model deployment.
Key insights
Fair LLM outputs can mask exploitable, causally potent, and asymmetrically biased internal representations.
Principles
- Behavioral audits alone are insufficient for LLM safety.
- Latent bias can be asymmetric and model-dependent.
- Suppression, not elimination, creates vulnerabilities.
Method
The study used a matched-pair design with synthetic mortgage applications, representational analysis, activation steering (including cross-layer), prompt engineering, and parameter-efficient fine-tuning to expose latent bias.
In practice
- Implement dual-layer testing frameworks for AI governance.
- Combine output evaluation with representational analysis.
- Guard against prompt engineering and fine-tuning attacks.
Topics
- Large Language Models
- Algorithmic Fairness
- Latent Bias
- Activation Steering
- Mortgage Underwriting
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Ethicist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.