"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

This study evaluates language model lie detectors using new testbeds designed to verify hidden beliefs, addressing a critical flaw in prior research. Researchers introduced 13 reasoning model organisms with chain-of-thought verified beliefs and "Varied Deception," a prompted-lying testbed. Four detectors—a chain-of-thought judge, a logprob classifier, and two activation probes (including the new Did-You-Lie, DYL, method)—were assessed across 31 open-weight models (2B–1T parameters). Results show positive scaling with model capability (Spearman ρ 0.41–0.71) on prompted lying. However, on trained model organisms, activation- and logprob-based detectors significantly degraded, with median balanced accuracy falling from 0.81–0.85 to 0.50–0.65. Only the chain-of-thought judge maintained strong performance (0.82 balanced accuracy), partly due to verification methods. The findings indicate current lie detectors cannot reliably claim high-confidence model beliefs. Datasets, model organisms, and detectors are publicly available.

Key takeaway

For AI Security Engineers developing model auditing tools, recognize that current lie detection methods cannot reliably confirm model beliefs, particularly against training-induced deception. While detectors show promise on prompted lies, their performance degrades sharply on verified trained lying. You should prioritize developing robust belief verification techniques and training detectors on diverse model organisms to improve resilience. Focus on lie detection as part of a broader auditing toolkit, rather than relying on it for high-confidence claims about internal model states.

Key insights

Evaluating AI lie detectors critically depends on verifiable model beliefs, which current methods often fail to establish.

Principles

Method

Construct model organisms via prompt distillation on Qwen 3.5/3.6 27B, verifying hidden beliefs through chain-of-thought and held-out tasks, then evaluate four detector types.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.