# Analysis: what the numbers show

This maps the Gemini Flash replication in [`data.json`](./data.json) to the
claim that content specificity is cross-generator at density.

## The cross-generator comparison

| Effect (at density) | xAI grok | Gemini Flash |
|:--------------------|:--------:|:------------:|
| Specificity | Hedges g 1.651 | Cohen d 1.669 / Hedges g 1.636 |
| Quality demands | Hedges g 1.189 | Cohen d 0.861 / Hedges g 0.843 |

Specificity is virtually identical across the two generators. Quality demands
are weaker on Gemini Flash but point the same way. The xAI figures are from the
companion receipt
([catching-your-own-overclaim](/receipts/catching-your-own-overclaim)); the
Gemini Flash figures recompute from this kit's `raw_outputs` via `script.py`.

## Raw versus density (read this before quoting a number)

From `data.json -> analysis_from_source`:

| Effect | raw | density |
|:-------|:---:|:-------:|
| Specificity | d 0.666 | d 1.669 |
| Quality demands | d 2.254 | d 0.861 |

The raw and density columns invert. Raw, quality demands look like the big
effect and specificity looks small. That is the length confound: on Gemini
Flash, quality demands produce longer outputs (about 800 vs 718 words), which
inflates raw marker counts. Normalizing to markers per 1,000 words removes the
confound, and specificity becomes the dominant effect, matching xAI. This is the
same confound the original xAI experiment carried, resolved the same way.

So the cross-generator claim is precise: **specificity at density** replicates.
A raw-score comparison would tell the wrong story.

## A note on the metric

Effect sizes appear in both Cohen d and Hedges g. The vault and the model-by-
model summaries quote Cohen d (1.669); the small-sample corrected Hedges g is
1.636. At 10 runs per cell the correction is small. Use the same column when
comparing receipts.

## What the published posts claim, and where it comes from

- "Direction replicates across generators; clean magnitude nearly identical on
  xAI and Gemini Flash, not Claude-dominant" (T-301, T-388, T-391): the
  comparison table above.
- "The direction replicates on xAI and Gemini Flash" (T-353): same, this is the
  specificity result underlying the constraint-paradox convergence claim.
- "Not Claude-dominant": the earlier Claude-leading numbers (Claude d 0.93, GPT
  d -0.01) were cross-evaluator magnitudes from a different, length-confounded
  experiment, not this matched-length 2x2. They are not comparable and are not
  used here.

## Honest limits

- 10 runs per cell, 40 outputs, single generator per kit. CIs exclude zero but
  are wide (Gemini Flash specificity density CI [1.12, 2.43]).
- Cross-generator means xAI and Gemini Flash only. Gemini Pro was inconclusive
  (truncation), not a confirming result.
- Programmatic marker scoring, not domain-expert quality. The specificity effect
  is on verifiable form, not on substance a domain expert would rank higher.
- One task, one domain (Northvane). March 2026 models.
