Three AIs, No Source, the Same Answer
By Lovro Lucic ·
The Fabrication Problem · 3 of 5
Asked an AI for an analysis and gave it no source to work from. It cited a real study, with a real number. So I tried two more, from two other companies. Same study. Same number. None of them had been handed it.
The topic was remote-work productivity. Grok, Gemini, and GPT-5-mini all reached for Bloom's 2015 Ctrip trial and its 13 percent figure. Five of six runs landed on the number. They reconstructed it from training, not from anything I gave them.
That is the tell. The model is not looking a fact up and getting it right. It is sampling from a distribution, and the distribution has peaks sharp enough that different companies' models hit the same one. This time the peak was real. Bloom's study exists, and 13 percent is roughly its finding.
Sit with how that number read. Specific. Sourced. The kind of thing you would drop into a memo without a second thought. It was solid, and you would have been right to use it.
Now the part that should bother you. The same machinery produces fake numbers that read exactly as solid. Same confidence. Same specificity. Same clean citation. Nothing you can see in the finished text separates the real peak from the invented one, because the model did not do anything different to produce them. It sampled a peak both times. From inside the output, the true number and the fabricated one are the same experience. You have been telling them apart by feel, and the feel is identical.
So how often can you even check the number? Same three models, same prompts, one change: paste a real source into the chat first, then count how many of the numbers each model produces actually appear in it.
no source source pasted
grok-4-1-fast 34% 92%
gemini-3-flash 41% 86%
gpt-5-mini 12% 95%
Programmatic matching, no human judgment. With no source in the window, between 59 and 88 percent of the numbers, depending on the model, are ones you cannot check against anything you gave it. Some are real anyway, Bloom's 13 percent was, but nothing in the text tells you which. There is one apparent tell, and it is weaker than it looks: with no source the models produced far fewer numbers (grok dropped from 164 to 106, GPT-5-mini from 73 to 8) and leaned qualitative. But the prompt had told them to use qualitative language whenever they could not source a figure, so that is mostly instruction-following, not the model sensing it was empty.
Why is there nothing to read? Because a model with no tools and no web access has no fact database to consult during generation. It runs a forward pass that predicts the next token from a distribution shaped by training, conditioned on whatever is in the context window. There is no step where it looks Bloom up, succeeds, then looks the fake number up and fails. Both numbers come out the same way: a sample from that distribution. With no source, the likeliest number for "remote work productivity" is the most-cited figure in the training corpus. Paste the report, and the likeliest number becomes the one in the report, because the report is now the most relevant evidence for what comes next. The distribution updates on the input. The input is the substrate. (This is why retrieval-augmented generation works: it loads the source into context before generating. RAG is an industrial version of a move you can make by hand, and it is also why a chat tool with web search can sidestep this by fetching a source before it answers.)
Which means "the model knows X" is a claim about training data. "The model can analyze X" is a claim about what is in the context window. They feel like the same sentence. The gap between them is where every fabricated number you have ever trusted came from.
The only way to make the peak trustworthy is to make it yours: put the real source in the window so the number the model samples is the one you handed it. The how-to is in Source Conditioning: paste the data, add a line prohibiting unsourced numbers, match the output against the source afterward. This piece is why that beats every prompt trick. A better role, chain-of-thought, "be careful and verify": each one asks the model to sample differently from the same distribution. Better sampling from a fictional distribution is still fiction. Changing the source changes which distribution it draws from. Nothing else does.
One honest boundary. This is for work where a source exists: summarize this, analyze these reports, what did the feedback say. For pure ideation or reasoning from scratch there is nothing to paste, and none of this helps. But that is a smaller slice of real work than it feels like. Most AI use is reformulation of something you already have, and most of the time the something never makes it into the window.
So do this once. Take the last AI output you acted on. Pick one number in it. Find the source that number came from. If you can, good. If you cannot, you were not reading a fact. You were trusting a peak, because it was confident and specific, and confident and specific is exactly what a fabricated number looks like too.
What survived testing
- Source presence moved numerical match rates from 12 to 41 percent up to 86 to 95 percent across three model families. Average gap +62pp, under a prompt that already asked for sourcing, so the portable finding is the delta, not the absolute rates. Direction universal; magnitude largest on the cleanest baseline (GPT-5-mini, +82pp).Copy link
- Cross-model convergence: with no source, three families reconstruct the same real Bloom 2015 Ctrip study and its 13 percent figure from training alone. The number-convergence is spontaneous; the source-naming was prompted (the task asked them to cite sources). Retrieval from weights, not from anything provided.Copy link
- The architectural claim (no factual database in a tool-free forward pass) is consistent with the convergence finding and with why RAG works in production.Copy link
- What did not survive:Copy link
- "Three models produced 13 percent without naming the study" cut. The data shows the opposite: all six runs named the study, five stated 13 percent.Copy link
- "Fabricated 13 percent anchor" cut. The number is real. The fabrication lives in the aggregate match rate, not in this datapoint.Copy link
- "Source grounding solves AI fidelity" cut. It sets the floor at 86 to 95 percent; the residual needs prevention tools, and it reaches the numerical layer only.Copy link
- "The models hedged because they sensed they had nothing" softened. The prompt instructed qualitative language when a figure could not be sourced, so the number-density drop is mostly instruction-following, not a spontaneous signal.Copy link
Honest limits
- Reformulation tasks with source material in context. Not validated for novel reasoning, creative generation, or strategy from scratch.Copy link
- Numerical fidelity specifically. Entity, claim, and reasoning fidelity need separate measurement.Copy link
- Three model families, May 2026 (grok-4-1-fast, gemini-3-flash-preview, gpt-5-mini), N=2 versions per cell. Adequate for direction, not for tight effect-size intervals.Copy link
- Tool-free, single-prompt generation. A model with web search or RAG can fetch a source and sidestep the whole effect; this is the bare model with nothing but its weights.Copy link
- ## Audit the data yourselfCopy link
- The replication kit at [/receipts/source-is-the-substrate](/receipts/source-is-the-substrate) has the runnable test, the per-cell results, and the verbatim source-absent generations from all three models. The match rates re-derive from `preflight_v3_results.json` without API access. The Bloom convergence is shown in the models' own words, so you can check the attribution against what they actually wrote.Copy link
Next in The Fabrication Problem
Why AI Can't Check Its Own WorkExplore other threads
The Evaluation Problem
2 findingsJudgment goes quiet. You can't see the gaps. Satisfaction is the trap. Stronger evaluators discriminate less.
The "It Depends" Problem
3 findingsSame instruction, opposite results. Specificity is the lever. Context redirects, not informs. The measurement itself was wrong.
The "What You Think Works" Problem
1 findingTemporal decay is a myth. Self-critique circles. Constraints narrow. Quality ceiling per mode.
New findings when they land.
No spam. Just what held up.