accessibility.skipToMainContent
Back to blog
Safety

The fact dilution problem: why 98% accurate AI becomes complete nonsense

Current AI systems are like a game of telephone where each step loses 2% of the truth. After 50 steps, you're left with pure hallucination. Here's the math nobody wants to talk about.

by Harm Geerlings
January 20, 2026
19 min read
4 views
01

The chemistry experiment that explains why your AI fails

Imagine you have a beaker of pure water. Someone tells you to remove exactly 2% of the water and replace it with something harmless. You do this once, you still have 98% water. No problem.

Now repeat that process 50 times. How much water do you have left?

The answer depends entirely on which question you're answering. And that distinction is exactly why most current AI systems become completely unreliable in multi-step reasoning tasks.

This isn't a metaphor. This is mathematics. And it's destroying AI projects across every industry.

The two ways to think about error

When people talk about AI accuracy, they're usually thinking about what statisticians call independent error. Each AI operation has a 2% chance of being wrong. The next operation is independent. It also has a 2% chance of being wrong.

Under this model, after 50 operations, you've made 50 independent 2% errors. That's roughly one error total. No big deal, right?

But that's not how AI actually works. AI systems build on previous outputs. Each step conditions on what came before. And that changes everything.

In chemistry, when you repeatedly dilute a solution, you're applying a decay factor to what remains. You don't subtract 2% of the original concentration each time. You reduce the current concentration by 2%.

These sound similar. They produce radically different outcomes.

The linear illusion

Let's start with the wrong way of thinking about it, because this is how most AI companies actually model their systems.

Linear decay assumes you're always removing 2% of the original amount. Start with 100% accuracy. Step 1: you're at 98%. Step 2: you're at 96%. After 50 steps, you're at exactly 0%.

Simple. Predictable. And completely wrong for AI systems.

This linear model is what leads companies to believe their AI is safe. They test single-step accuracy, find it's 98%, and assume multi-step operations will degrade linearly. They deploy agents, reasoning chains, multi-hop queries. Then they watch their systems fail catastrophically.

The problem is that AI errors don't work like independent coin flips. They compound.

The exponential reality

Here's what actually happens. Each AI operation preserves 98% of whatever truth remained from the previous operation. But crucially, that 98% is of the diminishing remainder, not the original.

The math is simple exponential decay: 0.98 to the power of n steps.

Let me show you what this actually looks like:

  • Step 0: 100% accuracy (perfect truth)
  • Step 10: 81.7% accuracy
  • Step 25: 60.3% accuracy
  • Step 50: 36.4% accuracy
  • Step 100: 13.3% accuracy
  • Step 228: 1% accuracy
  • Step 342: 0.1% accuracy

Read that again. After just 50 reasoning steps with 98% per-step accuracy, your system is more likely to be wrong than right. After 100 steps, it's wrong 86.7% of the time. After 228 steps, there's barely 1% truth remaining.

This is why your AI agents fail. This is why multi-hop reasoning produces nonsense. This is the mathematical foundation of the hallucination snowball.

Error Accumulation: Three Models How accuracy degrades across reasoning steps (2% error per step) 0% 25% 50% 75% 100% Accuracy 0 25 50 75 100 125 Reasoning Steps Linear (wrong) 50 = 36% 100 = 13% Exponential (reality) 100% then fail 50% threshold Critical insight: Most AI companies assume linear decay. Reality is exponential. After 34 steps, 98%-accurate AI is more likely wrong than right. After 114 steps: 90% failure rate.

The threshold of uselessness

Here's a question nobody in AI wants to answer: at what point does a system become so unreliable that it's effectively useless?

The mathematical answer is 34 steps. At 98% per-step accuracy, after 34 reasoning operations, the system is below 50% accuracy. It's more likely to be wrong than right.

But the practical answer comes much earlier. In production systems, you can't tolerate anything close to 50% error. You need 90% reliability or higher. That threshold is reached at just 11 steps.

Let me be explicit about what this means:

  • 11-step reasoning chain: 90% of your outputs are wrong
  • 34-step reasoning chain: your system is worse than random chance
  • 50-step reasoning chain: 63.6% failure rate
  • 100-step reasoning chain: 86.7% failure rate

Now consider what this means for agentic AI. A typical agent workflow might involve: understand task (1), decompose into steps (2), search for information (3), evaluate sources (4), synthesize findings (5), generate response (6), verify quality (7), and so on. That's already 7 steps, and we haven't even reached complex tasks.

Multi-hop reasoning chains in legal research, medical diagnosis, or financial analysis routinely exceed 20 steps. At 98% per-step accuracy, you're looking at 33% failure before you even consider complexity.

This isn't theoretical. This is why AI agents fail in production.

The production disaster nobody discusses

The statistics are devastating, yet almost never acknowledged in AI marketing materials.

Enterprise AI failures: According to 2025 research from MIT and Fortune, 95% of generative AI pilots fail to reach production with measurable business impact. Not "struggle to reach production." Fail completely.

Agent-specific failures: LinkedIn analysis from AI practitioners shows 95% of AI agents fail in production. Not because the models aren't intelligent enough. Because error accumulation makes them unreliable.

Multi-agent systems: Research shows that when multiple agents collaborate, errors compound faster. If one agent passes flawed information to another, the second agent builds on errors, and degradation accelerates.

The economic impact: Companies are spending hundreds of millions on AI systems that fundamentally cannot work for their intended use cases. A single multi-step agent deployment can cost millions to develop, yet fail because of basic mathematics.

This is the 98% problem in practice: great single-step accuracy, catastrophic multi-step failure.

AI Production Failure Reality (2025) What AI Companies Promise Single-step accuracy: 98% Amazing results in demos Multi-agent capabilities Complex task automation Enterprise-ready Production deployment ROI positive Business value guaranteed Based on single-step testing in controlled environments Production Reality 95% of pilots fail MIT/Fortune 2025 95% of AI agents fail in production (LinkedIn analysis) Multi-step = multi-failure Error accumulation destroys reliability beyond 10 steps 42% abandoned initiatives in 2025 (up from 17%) (S&P Global research) Based on real-world deployment Reality

The hallucination snowball effect

Research from Zhang et al. (2023) identified what they call the "hallucination snowball." Here's how it works: LLMs over-commit to early mistakes, then generate additional false claims to justify those mistakes. The error doesn't just propagate. It grows.

Think about what this means in the context of exponential error decay. Your first error at step 5 doesn't just reduce accuracy by 2%. It creates a flawed foundation for step 6, which now has even higher error probability because it's building on wrong assumptions.

The pure exponential decay model is actually optimistic. In practice, errors snowball faster than the math predicts because each error makes subsequent errors more likely.

This is why we see documented cases like:

CNET's AI disaster (2023): 41 out of 77 AI-written articles required corrections. That's a 53% error rate in production journalism, where single-digit error rates would be unacceptable.

Medical diagnosis failures: JAMA Pediatrics study found ChatGPT made incorrect diagnoses in over 80% of pediatric cases. This isn't "hallucination" in the abstract. These are specific medical errors that could harm patients.

Legal AI hallucinations: Stanford HAI research shows legal AI models hallucinate in 1 out of 6 benchmarking queries. Lawyers have been sanctioned for submitting AI-generated fake cases to courts. Multiple times. In multiple countries.

Google AI Overview failures: The system suggested putting glue on pizza and eating rocks daily. These aren't edge cases. They're what happens when error accumulation meets confidence without verification.

The verification trap

Here's the ironic part. We know LLMs can identify their own mistakes. Research shows ChatGPT identifies 67% of its errors, GPT-4 identifies 87%. The models know when they're wrong.

But they still commit to the hallucinations. They generate false claims to justify initial errors. They over-commit to mistakes despite having the capacity to recognize them.

This is why simple verification doesn't solve the problem. Adding a "check your work" step doesn't help when the system is incentivized to defend its previous outputs rather than correct them.

The verification step itself becomes another step in the reasoning chain. Another 2% error. Another opportunity for the snowball to grow.

Why current approaches can't fix this

The AI industry's response to error accumulation has been to try harder. More training data. Better fine-tuning. Clever prompting. Chain-of-thought reasoning. Verification steps.

None of this addresses the fundamental mathematical problem.

More training doesn't help: Better single-step accuracy doesn't change exponential decay. 99% accuracy just moves the threshold from 34 steps to 69 steps. 99.5% moves it to 138 steps. Meanwhile, you're spending exponentially more compute for marginal gains.

Better prompting doesn't help: Prompting strategies are essentially trying to fight mathematics with natural language. You can't prompt your way out of (0.98)ⁿ.

Verification compounds the problem: Each verification step is another operation with its own error probability. You're adding steps to fight the problem caused by having too many steps.

Ensemble methods help but don't solve: Research shows self-consistency methods can improve accuracy by up to 17.9 percentage points on math problems. But this comes at the cost of 40× more computation. And it doesn't eliminate exponential decay. It just shifts the curve slightly.

The fundamental issue isn't training quality or prompting strategy. It's that floating-point neural networks are fundamentally probabilistic. Every operation introduces uncertainty. Uncertainty compounds. There's no way around this mathematics.

The constraint-based solution

Constraint-based AI systems don't follow the exponential decay model. Here's why.

Deterministic operations: Our approach uses discrete operations. XNOR, POPCNT, logical AND, OR. These operations are deterministic. Same input, same output. Every single time.

No rounding errors: Binary values are exact. +1 or -1. No floating-point approximation. No accumulated rounding error.

Constraint satisfaction: Our systems work with constraints, not probabilities. A constraint is either satisfied or not. There's no 98% satisfaction. There's satisfied (100%) or violated (0%).

Crystallized constraints: In Dweve's approach, once a constraint is discovered and crystallized, it applies deterministically. The hundredth application of a constraint is as reliable as the first. No decay. No accumulated error.

This is why constraint-based systems can handle multi-hop reasoning without degradation. Each hop checks against crystallized constraints. Hop 10 is as reliable as hop 1. Hop 100 is as reliable as hop 1.

The error curve doesn't look like exponential decay. It looks like a step function: 100% accuracy until a constraint boundary is hit, then 0% (detectable failure). No gray zones. No gradual decay into nonsense.

Traditional vs Constraint-Based: Reliability Over Time How multi-step reasoning degrades in different architectures 0% 25% 50% 75% 100% Reliability 0 25 50 75 100 125 Reasoning Steps Step 25: 60% Step 75: 22% Traditional Networks Gradual decay into unreliability Step 700: Fail Step 400: 100% Step 100: 100% Dweve Constraint-Based 100% reliable until boundary Dweve: 100% reliable for 700+ steps

The regulatory angle

European regulators understand this problem better than American tech companies want to admit.

The EU AI Act doesn't just mandate accuracy. It mandates explainability and auditability. You need to explain why your AI made a specific decision. You need to prove it works correctly.

How do you prove a system works correctly when its reliability decays exponentially with reasoning depth?

You can't.

This is why GDPR Article 22's right to explanation and the EU AI Act's transparency requirements fundamentally favor constraint-based approaches. When a decision is the result of constraint satisfaction, you can explain it. Here's constraint A, constraint B, constraint C. All satisfied. Output follows logically.

When a decision is the output of 50 probabilistic operations, each compounding the uncertainty of the last? You can't explain that. You can't even reliably reproduce it.

This isn't a compliance burden. This is mathematics catching up with marketing claims.

The business implication

Here's what exponential error decay means for AI in business:

Simple tasks: Single-step operations work fine. Classification, basic question answering, simple retrieval. 98% accuracy is genuinely useful here.

Medium complexity: Multi-step but bounded operations are risky. You can probably handle 5-10 steps if you're careful. But you're approaching the threshold where errors accumulate faster than value is created.

High complexity: Deep reasoning chains, agent workflows, multi-hop queries are mathematically infeasible with floating-point probabilistic approaches. The system will fail. It's not a question of if, but when.

This explains why 95% of enterprise AI pilots fail. Companies are trying to solve problems that require 20, 50, 100 reasoning steps using systems that become unreliable after 11.

The mathematics don't care about your use case. They don't care about your budget. They don't care about your ambitious roadmap. (0.98)ⁿ goes to zero regardless of intentions.

The path forward

We've identified the problem. Exponential error accumulation makes floating-point neural networks unsuitable for multi-step reasoning. The mathematics are clear. The production failures are documented. The economic costs are measurable.

The solution is equally clear: we need AI systems that don't suffer from exponential decay.

Constraint-based AI provides exactly this. Deterministic operations. Crystallized constraints. No accumulated error. Multi-hop reasoning without degradation.

This isn't speculative. This is what we're building at Dweve. Core provides the binary algorithm framework. Loom implements 456 constraint-based experts. Nexus provides the multi-agent orchestration layer. Each operation is mathematically exact. Each decision is traceable to specific constraints.

The result: AI systems that remain reliable across hundreds of reasoning steps. Not 98% accurate in step 1 and 36% accurate in step 50. 100% accurate in step 1 and step 50 and step 500.

Until the constraint boundary is hit, reliability is absolute. At the boundary, failure is detectable. The system knows when it doesn't know. That's not a bug. That's safety.

What you need to remember

  • Error accumulation is exponential, not linear. Each multi-step AI operation compounds previous errors. 98% per-step accuracy becomes 13% success after 100 steps.
  • The threshold of uselessness arrives quickly. At 98% per-step accuracy, systems drop below 50% reliability after just 34 steps. For practical purposes, the threshold is around 11 steps for 90% reliability.
  • Hallucinations snowball, they don't just propagate. LLMs over-commit to early mistakes and generate additional false claims to justify them. Error accumulation accelerates beyond pure exponential decay.
  • Production failure rates are catastrophic. 95% of generative AI pilots fail to reach production. 95% of AI agents fail in deployment. This isn't bad engineering. It's bad mathematics.
  • Verification doesn't solve the problem. Adding verification steps adds more operations with their own error probabilities. You're fighting exponential decay with more exponential decay.
  • Constraint-based systems don't suffer from exponential decay. Deterministic operations and crystallized constraints mean step 100 is as reliable as step 1. No accumulated error. No gray zones.
  • European regulations favor mathematical certainty. The EU AI Act's explainability and auditability requirements align with constraint-based approaches and conflict with probabilistic black boxes.

The bottom line

The 98% problem is real, measurable, and destroying AI projects across every industry. When each operation loses 2% of truth and errors compound across reasoning steps, systems become mathematically guaranteed to fail.

This isn't about better training data or smarter prompts. This is about the fundamental mathematics of floating-point neural networks versus constraint-based reasoning.

Traditional approaches follow exponential decay: (0.98)ⁿ approaches zero as n increases. There's no way around this. It's baked into the mathematics.

Constraint-based approaches operate differently. Deterministic operations. Crystallized constraints. Step 500 is as reliable as step 1. The error curve is a step function, not exponential decay.

The industry is slowly waking up to this reality. Companies are spending hundreds of millions on systems that are mathematically guaranteed to fail. The 95% production failure rate isn't mysterious. It's predictable.

European AI companies building on constraint-based foundations aren't at a disadvantage. They're solving the actual problem while American companies double down on flawed mathematics.

The future of reliable AI isn't more compute, bigger models, or cleverer prompts. It's constraint-based systems with crystallized constraints. Mathematical certainty instead of statistical confidence. Provable reliability instead of exponential decay.

Want AI that doesn't decay into nonsense? Dweve Core's constraint-based framework provides deterministic multi-step reasoning. No exponential error accumulation. No hallucination snowballs. Just mathematics that works. Join our waitlist.

Tagged with

#Hallucination#Error Accumulation#AI Reliability#Constraint-Based AI#Production AI

About the Author

Harm Geerlings

CEO & Co-Founder (Product & Innovation)

Building the future of AI with binary neural networks and constraint-based reasoning. Passionate about making AI accessible, efficient, and truly intelligent.

Stay updated with Dweve

Subscribe to our newsletter for the latest updates on binary neural networks, product releases, and industry insights

✓ No spam ever ✓ Unsubscribe anytime ✓ Actually useful content ✓ Honest updates only