LLMs believe false statements even after explicit warnings that they're false

· Source: AI - Ars Technica · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

New research on "negation neglect" reveals that Large Language Models (LLMs) absorb explicitly false statements from training data, even when those statements are clearly labeled as false. An international team found that after fine-tuning with synthetic documents containing false claims (e.g., "Ed Sheeran won the 100m gold medal"), models like Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 exhibited high belief rates, with Qwen's rising from 2.5 percent to 92.4 percent. Even when documents included explicit warnings like "NOTICE: Upon examination, the claims in the document below are entirely false," LLMs still showed an 88.6 percent belief rate. This "negation neglect" also extended to misaligned behaviors. While specific corrections reduced belief to 39.9 percent, the effect was largely mitigated only when negations were integrated "locally" within the same sentence as the false claim. This suggests LLMs prioritize statistical patterns over explicit framing during fine-tuning.

Key takeaway

For Machine Learning Engineers structuring LLM training data, you must prioritize localized negation to prevent "belief implantation." Explicit document-level or sentence-level warnings against false claims are largely ineffective during fine-tuning. Instead, integrate negations directly within the same sentence as the false statement (e.g., "X did not happen") to effectively mitigate the absorption of falsehoods and reduce model hallucination. This approach is crucial for developing more reliable and factually grounded LLMs.

Key insights

LLMs prioritize statistical patterns in training data over explicit negation, leading to "belief implantation."

Principles

Method

Researchers generated synthetic documents with false claims and explicit negations, then fine-tuned LLMs (Qwen3.5-35B-A3B, Kimi K2.5, GPT-4.1) to measure "belief rates" and "misalignment."

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.