Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, FinTech & Digital Financial Services · Depth: Expert, extended

Summary

A study on instruction-tuned large language models (LLMs) like Gemma-3-12B-IT, Llama-3.1-8B-Instruct, and Qwen2.5-14B-Instruct reveals that while they exhibit behavioral fairness in high-stakes decisions like mortgage underwriting, they retain and amplify biased associations in their internal representations. Using a synthetic dataset of 1500 paired mortgage applications differing only in racially-associated names, the researchers found no output-level bias (e.g., approval rates of 27.27% for White-associated names vs. 27.13% for Black-associated names in Gemma-3-12B-IT). However, internal demographic representations increased monotonically from 0 to ~1200 in Gemma-3's penultimate layer. Through activation steering and cross-layer interventions, this suppressed information was shown to be decision-relevant, causing near-complete decision reversals when reinjected at critical layers. This latent bias is asymmetric, affecting decisions in one demographic direction more than the reverse, and is susceptible to adversarial prompt engineering and parameter-efficient fine-tuning with fewer than 6,000 parameters.

Key takeaway

For CTOs and VPs of Engineering deploying LLMs in high-stakes financial services, relying solely on output-based fairness audits is insufficient and risky. You must integrate mechanistic interpretability techniques, such as activation steering and representational analysis, into your AI governance frameworks. This will help detect hidden, asymmetric biases that could be exploited through adversarial prompting or minimal fine-tuning, ensuring robust and truly fair model deployment.

Key insights

Fair LLM outputs can mask exploitable, causally potent, and asymmetrically biased internal representations.

Principles

Method

The study used a matched-pair design with synthetic mortgage applications, representational analysis, activation steering (including cross-layer), prompt engineering, and parameter-efficient fine-tuning to expose latent bias.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Ethicist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.