Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]
Summary
Research introduces a probe-targeted fine-tuning (LoRa) method to improve Large Language Model (LLM) verbal confidence calibration. Instruct-tuned LLMs internally distinguish correct from incorrect answers with 0.76–0.88 AUROC by probing hidden states, yet typically express approximately 99% confidence for all responses. The proposed technique uses the probe's output as fine-tuning targets, enabling models to articulate their internal metacognitive knowledge. This process was demonstrated on 8 models across 4 families (7B–70B), requiring few hundred examples and under 10 minutes on an M3 Ultra. Activation patching confirmed the causal link, showing a ρ = 0.976 layer gradient when swapping hidden states at confidence positions. Notably, 70B models exhibited valid metacognitive signals in their softmax distribution, though their argmax text remained overly confident, indicating a text bottleneck. Seed-level replication confirmed stable discrimination but seed-sensitive confidence distribution shapes.
Key takeaway
For Machine Learning Engineers aiming to enhance LLM reliability, you should consider implementing probe-targeted fine-tuning to enable models to accurately express their confidence. This technique helps overcome the inherent bias from RLHF that penalizes uncertainty, allowing your models to verbalize what their hidden states already know. You can achieve this with LoRa, using only a few hundred examples, to route internal metacognitive signals to the verbal output, leading to more trustworthy AI systems.
Key insights
LLMs internally know their confidence but need fine-tuning to express it verbally, overcoming RLHF's bias.
Principles
- LLM hidden states contain metacognitive signals.
- RLHF can suppress truthful uncertainty expression.
- Activation patching confirms causal links.
Method
Use probe output from hidden states as fine-tuning targets (LoRa) to teach LLMs to verbalize internal confidence. This routes the internal signal to verbal output.
In practice
- Apply LoRa with few hundred examples.
- Test on 7B–70B models.
- Use M3 Ultra for rapid tuning.
Topics
- LLM Confidence Calibration
- Probe-Targeted Fine-Tuning
- LoRa
- Metacognition
- Activation Patching
- RLHF Bias
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.