Why Frontier LLMs Disagree on 67% of Fact-Check Claims
A new study shows that five leading LLMs disagree on the veracity of two-thirds of real-world fact-check claims, raising profound questions about their reliability for truth verification.
Five frontier LLMs were asked to classify 1,000 real-world fact-check claims as True, Mostly True, Misleading, or False. They agreed on only 33% of them. This high level of LLM disagreement is the headline from a new study by Lenz Research, and it's a gut punch for anyone hoping LLMs can serve as reliable truth arbiters.
The authors sourced claims from a real fact-checking platform — not curated benchmarks with answer keys — and ran them through GPT-5.4, Opus 4.7, Gemini 3 (standard and retrieval-augmented), and Sonar Pro. The result: rampant disagreement, even on seemingly straightforward facts. As one commenter on Hacker News noted, the prompt itself is brutally simple: “Classify this claim as of <date>: '<atomic claim>'. Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers.” No room for hedging. Yet the models still can't agree.
The LLM Disagreement Study at a Glance
The study, published at lenz.io/research/llm-disagreement, tested five frontier LLMs on claims like “All almonds are grown in the U.S. state of California” and “Extraterrestrial life exists somewhere in the universe.” The data is publicly available as a CSV file.
On the almond claim, four models correctly said False, but Opus 4.7 called it “Misleading” — an odd choice given that almonds are grown in multiple countries, including Spain and Australia. On the extraterrestrial life claim, three models said False, but two said Misleading. The problem is that the ground truth for that claim is “unknown,” and the label set doesn't allow “cannot be determined.” This forces models to make an unwarranted assertion of falsehood or misleadingness.
The study also highlights that the claims are not filtered by ambiguity — they're exactly what users submitted to a real fact-checking platform. This is a strength (real-world relevance) but also a weakness (many claims are inherently unverifiable or depend on context). Still, the level of disagreement is startling.
HN Reaction: LLM Disagreement Raises Skepticism
The Hacker News thread (currently at 407 points and 279 comments) is full of healthy skepticism. Some commenters question the methodology, while others see confirmation of their own biases about AI fallibility.
One commenter wrote: “It just shows that fact-checking is not a thing for 99% of the cases. It's interesting to see it in LLMs, but it's not unique to them. The ‘fact checkers’ pretend they are objective and authoritative, but they are not, they are just one more opinion.”
Another pointed out the label set issue: “For the research, the four classification options are too many, it should be true, false, and maybe ‘can't be determined’.”
There's also criticism about missing models: “Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.”
And the meta irony wasn't lost on readers: “I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place?”
My Take on LLM Reliability for Truth Verification
This study is valuable, but it's not a damning indictment of LLMs — it's a damning indictment of using LLMs for fact-checking in the first place. The models are designed to generate plausible text, not to reason about objective truth. They have no access to a ground-truth database; they rely on patterns in their training data. When you force a binary or quad-chotomous classification on an inherently uncertain claim, you're measuring how well the model mimics the typical human response in its training data, not how accurate it is.
The core issue is the label set. By omitting “Unverifiable” or “Insufficient information,” the study guarantees forced errors. For claims like “Extraterrestrial life exists somewhere,” any answer is wrong because no human can definitively classify it. The models that said False are no more correct than those that said Misleading — they're both wrong relative to the missing “unknown” category.
Still, the disagreement matters. If you're building a system that relies on an LLM to determine whether a statement is true, you need to know that different models will give you different answers on two out of three real-world claims. That's a showstopper for any use case where consistency is required, such as automated moderation or fact-checking pipelines.
Practical Advice for Builders Using LLMs for Claim Verification
If you're integrating an LLM for fact-checking, you have three options:
- Use a retrieval-augmented generation (RAG) approach with a trusted knowledge base. The study includes a RAG variant of Gemini 3, but it still disagreed with others — probably because the retrieval source itself may have gaps or biases.
- Reduce the label set to just True, False, and Unknown. As commenters suggested, this avoids forcing the model into misleading classifications.
- Ensemble multiple models and treat only unanimous or majority-vote answers as reliable. But even then, agreement is only 33% — you'll discard two-thirds of claims.
Here's a practical example. Suppose you're building a claim verification tool and want to classify user-submitted statements. A minimal prompt might look like:
prompt = f"""
Classify this claim as of 2025-01-01: "{claim}"
Output exactly one label: True, False, or Unknown.
No explanations.
"""
response = llm.generate(prompt)
Even then, you'll get disagreements across models. You need to design your system to handle uncertainty gracefully — surface the fact that the LLM is not a definitive source.
Consider also the cost and latency of running five models for every claim. For high-volume applications, that's prohibitive. A better approach is to flag claims that are likely to be controversial (e.g., political or scientific claims) and route them to human fact-checkers, while using a single LLM only for clear-cut, high-confidence claims.
Should You Care About LLM Disagreement?
If you're building any application that relies on LLMs to make truth-value judgments — fact-checking dashboards, automated moderation, educational tools — yes, you should care deeply. The study shows that even the best models are unreliable for this task. If you're a casual user who just wants to check an occasional claim, you can still get value by treating LLM outputs as suggestions that need verification from primary sources. If you're a researcher, the label-set issue is a critical methodological flaw to address in future studies. Everyone else: don't trust an LLM to tell you what's true until it can consistently agree with itself.