Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

2026-03-05 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, quick

Summary

Research investigates open-weight large language models (LLMs) from Chinese developers, specifically Qwen3 models, as a natural testbed for eliciting and detecting "secret knowledge"—information they are trained to suppress. These models frequently generate false information on politically sensitive topics like Falun Gong or the Tiananmen protests, yet occasionally provide correct answers, suggesting they possess the suppressed knowledge. The study evaluates various honesty elicitation and lie detection techniques. For elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data proved most reliable. For lie detection, prompting the censored model to self-classify its responses achieved near uncensored-model performance, with linear probes offering a cost-effective alternative. The most effective elicitation techniques also transferred to other frontier open-weight models, including DeepSeek R1, though no technique completely eliminated false responses.

Key takeaway

For NLP Engineers working with open-weight LLMs, especially those from regions with censorship, you should investigate sampling without chat templates or few-shot prompting to improve truthfulness. If you need to detect suppressed information, consider prompting the model to classify its own responses or implement linear probes, but be aware that no method fully eliminates false outputs.

Key insights

Censored LLMs provide a natural testbed for evaluating secret knowledge elicitation and lie detection methods.

Principles

Censored LLMs retain suppressed knowledge.
Elicitation techniques can increase truthful responses.

Method

The study evaluates honesty elicitation via sampling without chat templates, few-shot prompting, and fine-tuning, and lie detection via self-classification prompting and linear probes on censored LLMs.

In practice

Use few-shot prompting for honesty elicitation.
Consider linear probes for cost-effective lie detection.

Topics

Large Language Models
Honesty Elicitation
Lie Detection
Secret Knowledge Elicitation
Censored LLMs

Best for: Research Scientist, NLP Engineer, AI Researcher, AI Scientist, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.