"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Summary
This research evaluates lie detectors for language models, addressing the critical need for testbeds that verifiably confirm models' hidden beliefs. It highlights that existing trained model organisms often fail this requirement, complicating prior detection results. The study introduces 13 reasoning model organisms with chain-of-thought verified beliefs and "Varied Deception," a new prompted-lying testbed covering diverse motivations. Four detectors—a chain-of-thought judge, a logprob classifier, and two activation probes, including the new Did-You-Lie (DYL) method—were tested across 31 open-weight models ranging from 2B to 1T parameters. While all detectors showed positive scaling with model capability on prompted lying, activation- and logprob-based detectors sharply dropped on trained organisms. Only the chain-of-thought judge maintained strong performance with 0.82 balanced accuracy. The findings indicate current lie detectors cannot reliably support high-confidence claims about model beliefs. Datasets, model organisms, and trained detectors are released.
Key takeaway
For AI Scientists and Machine Learning Engineers developing or deploying language models, you should recognize that current lie detection methods, particularly activation- and logprob-based approaches, are unreliable for high-confidence belief verification. Your auditing and monitoring strategies should prioritize chain-of-thought based judges, which demonstrated 0.82 balanced accuracy, and be cautious about claims derived from other detector types, especially when assessing trained model organisms.
Key insights
Current language model lie detectors struggle to reliably verify hidden beliefs, especially on trained model organisms.
Principles
- Lie detector performance scales with model capability on prompted lying.
- Activation/logprob detectors degrade on trained model organisms.
- Chain-of-thought judges show robustness in belief verification.
Method
Developed 13 reasoning model organisms with chain-of-thought verified beliefs and a "Varied Deception" prompted-lying testbed to evaluate lie detectors.
In practice
- Use chain-of-thought judges for more robust belief verification.
- Consider "Varied Deception" for evaluating prompted lying.
Topics
- Lie Detection
- Language Models
- Model Organisms
- Chain-of-Thought
- Activation Probes
- Model Auditing
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.