Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Summary
Research investigates open-weight large language models (LLMs) from Chinese developers, specifically Qwen3 models, as a natural testbed for eliciting and detecting "secret knowledge"—information they are trained to suppress. These models frequently generate false information on politically sensitive topics like Falun Gong or the Tiananmen protests, yet occasionally provide correct answers, suggesting they possess the suppressed knowledge. The study evaluates various honesty elicitation and lie detection techniques. For elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data proved most reliable. For lie detection, prompting the censored model to self-classify its responses achieved near uncensored-model performance, with linear probes offering a cost-effective alternative. The most effective elicitation techniques also transferred to other frontier open-weight models, including DeepSeek R1, though no technique completely eliminated false responses.
Key takeaway
For NLP Engineers working with open-weight LLMs, especially those from regions with censorship, you should investigate sampling without chat templates or few-shot prompting to improve truthfulness. If you need to detect suppressed information, consider prompting the model to classify its own responses or implement linear probes, but be aware that no method fully eliminates false outputs.
Key insights
Censored LLMs provide a natural testbed for evaluating secret knowledge elicitation and lie detection methods.
Principles
- Censored LLMs retain suppressed knowledge.
- Elicitation techniques can increase truthful responses.
Method
The study evaluates honesty elicitation via sampling without chat templates, few-shot prompting, and fine-tuning, and lie detection via self-classification prompting and linear probes on censored LLMs.
In practice
- Use few-shot prompting for honesty elicitation.
- Consider linear probes for cost-effective lie detection.
Topics
- Large Language Models
- Honesty Elicitation
- Lie Detection
- Secret Knowledge Elicitation
- Censored LLMs
Best for: Research Scientist, NLP Engineer, AI Researcher, AI Scientist, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.