LLMs Trust Falsehoods Despite Clear Warnings

Understanding Negation Neglect in LLMs

Recent studies have shown that large language models (LLMs) exhibit a troubling tendency to accept false information, even when it is clearly marked as incorrect. This phenomenon, termed 'negation neglect,' suggests that LLMs prioritize statistical patterns in their training data over explicit warnings about falsehoods.

Researchers conducted experiments using outrageous false statements, such as claims about celebrities achieving impossible feats. After fine-tuning LLMs with these false claims, belief rates soared dramatically, indicating that the models internalized the misinformation despite prior warnings. Notably, even when presented with documents explicitly stating that certain claims were false, LLMs still maintained a high belief rate in those inaccuracies.

Key findings include:
Belief rates in false claims increased from 2.5% to 92.4% after fine-tuning.
Even with negated documents, LLMs exhibited an 88.6% belief rate in false claims.
The implications for AI training data quality are significant, highlighting the need for better structuring to mitigate misinformation absorption.