Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

2026-05-29 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Research introduces a probe-targeted fine-tuning (LoRa) method to improve Large Language Model (LLM) verbal confidence calibration. Instruct-tuned LLMs internally distinguish correct from incorrect answers with 0.76–0.88 AUROC by probing hidden states, yet typically express approximately 99% confidence for all responses. The proposed technique uses the probe's output as fine-tuning targets, enabling models to articulate their internal metacognitive knowledge. This process was demonstrated on 8 models across 4 families (7B–70B), requiring few hundred examples and under 10 minutes on an M3 Ultra. Activation patching confirmed the causal link, showing a ρ = 0.976 layer gradient when swapping hidden states at confidence positions. Notably, 70B models exhibited valid metacognitive signals in their softmax distribution, though their argmax text remained overly confident, indicating a text bottleneck. Seed-level replication confirmed stable discrimination but seed-sensitive confidence distribution shapes.

Key takeaway

For Machine Learning Engineers aiming to enhance LLM reliability, you should consider implementing probe-targeted fine-tuning to enable models to accurately express their confidence. This technique helps overcome the inherent bias from RLHF that penalizes uncertainty, allowing your models to verbalize what their hidden states already know. You can achieve this with LoRa, using only a few hundred examples, to route internal metacognitive signals to the verbal output, leading to more trustworthy AI systems.

Key insights

LLMs internally know their confidence but need fine-tuning to express it verbally, overcoming RLHF's bias.

Principles

LLM hidden states contain metacognitive signals.
RLHF can suppress truthful uncertainty expression.
Activation patching confirms causal links.

Method

Use probe output from hidden states as fine-tuning targets (LoRa) to teach LLMs to verbalize internal confidence. This routes the internal signal to verbal output.

In practice

Apply LoRa with few hundred examples.
Test on 7B–70B models.
Use M3 Ultra for rapid tuning.

Topics

LLM Confidence Calibration
Probe-Targeted Fine-Tuning
LoRa
Metacognition
Activation Patching
RLHF Bias

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.