Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Summary
Researchers constructed a testbed to evaluate honesty elicitation and lie detection techniques using censored Chinese Large Language Models (LLMs). This testbed comprises 90 questions on politically sensitive topics like Tiananmen Square, Falun Gong, and the treatment of Uyghurs, for which ground-truth facts were collected from uncensored LLMs. The study found that techniques such as sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most effectively increased truthful responses from models like Qwen3-32B. These methods also successfully transferred to frontier open-weight models, including DeepSeek-R1-0528, Qwen3.5-397B, and MiniMax-M2.5, significantly improving their truthfulness despite strong censorship. For lie detection, prompting the censored model to self-classify its responses performed nearly as well as an uncensored model, with linear probes offering a cost-effective alternative.
Key takeaway
For AI Researchers and Research Scientists working on model alignment and truthfulness, you should consider using naturally censored LLMs as a realistic testbed for developing and validating honesty elicitation and lie detection methods. Your efforts to improve truthfulness can benefit from techniques like few-shot prompting and fine-tuning, which have proven effective even on heavily censored frontier models, offering a robust approach to mitigating model dishonesty.
Key insights
Censored Chinese LLMs serve as a natural testbed for evaluating honesty elicitation and lie detection techniques.
Principles
- Models can possess knowledge they are trained to suppress.
- Inference-time honesty techniques transfer across LLM architectures.
- Self-classification is effective for lie detection in LLMs.
Method
The method involves creating a testbed of censored topics with ground-truth facts, applying various honesty elicitation techniques, and evaluating lie detection methods by prompting models to classify their own responses or using activation probes.
In practice
- Use few-shot prompting to increase truthfulness in LLMs.
- Apply next-token completion for uncensoring sensitive topics.
- Train linear probes for cheaper lie detection.
Topics
- Honesty Elicitation
- Lie Detection
- Censored LLMs
- Few-shot Prompting
- Linear Probes
Code references
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.