ArcTested March 2026 · Gemini / xAI· 3 min

The Strongest Finding Was Wrong

The strongest effect in 87 experiments was wrong.

Not the direction. The direction was right. Specific instructions produce more specific output than vague instructions. That replicated across evaluators, across models, across tasks. The direction survived everything.

The magnitude was wrong. d=2.34. Cited in seven pieces of writing. Referenced in thirty framework documents. Built into the theory of how specificity constrains AI output. The foundation number for the most-cited claim across all the work.

It was actually three effects stacked on top of each other, reported as one.

The original experiment compared a 22-word specific instruction against a 2-word vague instruction. "Do not produce a recommendation that could apply to any B2B SaaS company. Every point must be grounded in Northvane's specific situation" versus "Be specific." Twenty runs. Large effect. Replicated.

The obvious interpretation: specificity is the mechanism. But the comparison confounded specificity content with instruction length. 22 words versus 2. Any effect could be the extra instruction text, not the specificity within it.

A length-controlled replication added two conditions: a short specific instruction (8 words) and a long vague instruction (22 words with quality demands like "detailed, thorough, comprehensive"). Raw scores showed the long vague instruction beating the short specific one. First interpretation: length is the mechanism. The specificity claim collapses.

But that interpretation was also wrong. The long vague instruction produced 48 percent more words. More words produce more specificity markers mechanically. At density (markers per thousand words) the specific instruction was MORE specific per word. Length inflated the raw score. Specificity was real underneath. And the "long vague" instruction contained quality demands that could independently drive the effect.

Three confounded variables across the two replications. A clean decomposition required separating all of them.

A 2×2 factorial. Specificity: present or absent. Quality demands: present or absent. All instructions matched at 19 words. Output length constrained. Forty outputs. One generator.

Specificity: d=1.37 raw score (shorter output, more markers). Quality demands: d=0.11 at raw score, near zero. Both together: additive. The mechanism is specificity, not quality demands, not length. But the magnitude is d=1.37, not d=2.34.

The tool that caught the confound came from the measurement infrastructure I'd already built. Density analysis (dividing total markers by word count) was developed for fabrication measurement. Applied to the specificity experiment, it separated what raw scores had conflated. The system caught its own error using its own tools.

The measurement tool built for one problem (fabrication detection) revealed a confound in a different problem (specificity measurement). Density analysis isn't novel methodology. It's basic normalization that any researcher would apply. What made it possible here was having the measurement infrastructure already built and the habit of applying it reflexively. The confound led to a cleaner experiment. The cleaner experiment confirmed the mechanism at a smaller, more honest magnitude. The overclaim was replaced by a better-supported claim.

Then a domain expert evaluated the outputs blind. Couldn't tell which were produced with specific instructions and which with generic ones. Picked specific 3 out of 5 times, chance level. Both conditions produce the same analytical conclusions. The specificity instruction changes what the output LOOKS LIKE (more data references, more grounded language), not what it SAYS.

The direction held. The magnitude shrank. And the mechanism turned out to be about verifiability, not quality. Specific outputs can be checked, generic ones can't, but the analysis is the same either way. Three corrections deep. Each one more interesting than the finding it replaced.

Test this yourself

Take your strongest measured finding. Control for one variable you haven't isolated. See if the magnitude holds.

Evidence: HIGH. Clean 2×2 with confounds controlled: specificity d=1.37 (raw, CI [0.75, 2.38]), d=1.65 (density, partially inflated by shorter output). Quality demands d=0.11 raw (null), d=1.19 density. The honest range for specificity: d=1.37-1.65. Original d=2.34 was three effects stacked.

What survived testing

Specificity as mechanism (d=1.37-1.65, confounds controlled, CI excludes zero)
The decomposition methodology (density analysis catches confounds raw scores miss)
The self-correction trajectory (own tools caught own overclaim)

What didn't survive

d=2.34 as the effect size (specificity + length stacked; honest range d=1.37-1.65)
"Strongest effect in 87 experiments" (large, but not as large as claimed)
Clean separation at density level: quality demands show d=1.19 at density vs d=0.11 raw (density partially conflates specificity with shorter output length)

Honest limits

Clean 2×2 originally xAI only. Gemini Pro truncated (55-word outputs). Cross-gen confirmed on Gemini Flash (no length constraint, density normalization): specificity d=1.669 vs xAI d=1.651 at density.
Specificity heuristic (marker counts) validated against domain expert: 3/5 vs 2/5 (chance level). Expert couldn't distinguish SPECIFIC from GENERIC on quality, only on style. Specificity changes form (verifiable references), not substance (same conclusions). The d=1.37 measures a real output property that domain experts don't perceive as quality.
N=10 per condition. Effect sizes directional with CIs that exclude zero.
"Write exactly 500 words" did not fully control output length (368-443 words).

Full record

Source: EXP-025 (original N=20 + length-controlled 80 outputs + clean 2×2 40 outputs). Density analysis methodology from clarethium_measure. Three-stage decomposition: raw score → density → 2×2 factorial.
Context: Most-cited effect size across this research turned out to be inflated. Caught by own tools. The correction was more interesting than the original finding.
Measurement Basis: PROGRAMMATIC. All measurements are objective marker counts (company mentions, scenario numbers, market terms). No LLM quality judgment at any stage. The decomposition itself uses density analysis (markers per 1000 words), a ratio of two programmatic measures.
AI Dimension: STRUCTURAL. The piece is about how a measurement system catches its own confounds in AI prompt research. Remove AI and the piece doesn't exist.
Status: ARC TRANSMISSION

The "It Depends" Problem

Follow this research.

Leave your email to hear when new findings are published.