LLMs believe false statements even after explicit warnings that they're false
Summary
New research on "negation neglect" reveals that Large Language Models (LLMs) absorb explicitly false statements from training data, even when those statements are clearly labeled as false. An international team found that after fine-tuning with synthetic documents containing false claims (e.g., "Ed Sheeran won the 100m gold medal"), models like Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 exhibited high belief rates, with Qwen's rising from 2.5 percent to 92.4 percent. Even when documents included explicit warnings like "NOTICE: Upon examination, the claims in the document below are entirely false," LLMs still showed an 88.6 percent belief rate. This "negation neglect" also extended to misaligned behaviors. While specific corrections reduced belief to 39.9 percent, the effect was largely mitigated only when negations were integrated "locally" within the same sentence as the false claim. This suggests LLMs prioritize statistical patterns over explicit framing during fine-tuning.
Key takeaway
For Machine Learning Engineers structuring LLM training data, you must prioritize localized negation to prevent "belief implantation." Explicit document-level or sentence-level warnings against false claims are largely ineffective during fine-tuning. Instead, integrate negations directly within the same sentence as the false statement (e.g., "X did not happen") to effectively mitigate the absorption of falsehoods and reduce model hallucination. This approach is crucial for developing more reliable and factually grounded LLMs.
Key insights
LLMs prioritize statistical patterns in training data over explicit negation, leading to "belief implantation."
Principles
- LLMs exhibit "negation neglect" during fine-tuning.
- Explicit warnings are largely ineffective for false claims.
- Localized negation within sentences is most effective.
Method
Researchers generated synthetic documents with false claims and explicit negations, then fine-tuned LLMs (Qwen3.5-35B-A3B, Kimi K2.5, GPT-4.1) to measure "belief rates" and "misalignment."
In practice
- Integrate negations directly into false statements.
- Structure training data with local, in-sentence corrections.
- Avoid document-level or sentence-level warnings.
Topics
- Large Language Models
- Fine-tuning
- Negation Neglect
- Training Data Quality
- Model Hallucination
- Belief Implantation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.