Receipts
Receipts: Same Technique, Opposite Results
Raw artifacts behind the published finding. The prompts, the outputs, the scoring, and the analysis.
The content-specificity effect is not a quirk of one model. The clean 2x2 specificity experiment, rerun on a second generator family, produces a virtually identical effect size at density.
| Generator | specificity effect at density | source |
|---|---|---|
| xAI (grok-4-1-fast) | Hedges g 1.651 | catching-your-own-overclaim |
| Gemini Flash (gemini-3-flash-preview) | Cohen d 1.669 / Hedges g 1.636 | this kit |
Same design both times: a 2x2 (specificity present/absent, quality demands present/absent), 10 runs per cell, the same Northvane strategic-analysis task, density normalization instead of a length cap.
The one thing that matters here: "at density"
Raw scores do not show this cleanly. On Gemini Flash the raw specificity effect is only d=0.67, because quality demands produce longer outputs and inflate the raw marker count (the same length confound the xAI experiment hit). Normalizing to markers-per-1k-words removes the confound and the specificity effect lands at d=1.67. So the cross-generator claim is specifically: specificity at density is cross-generator. It is not a raw-score claim.
Scope
- Cross-generator means xAI and Gemini Flash. Gemini Pro was inconclusive (outputs truncated to about 60 words), not a confirming null.
- This kit holds the Gemini Flash side (40 raw outputs, the computed analysis). The xAI side is its own published receipt, linked above.
Recompute
python3 script.py
Expect Gemini Flash specificity at density: Cohen d 1.67, Hedges g 1.64.
Limits
- 10 runs per cell (40 outputs), one generator per kit. Directional, CIs exclude zero but are wide.
- Programmatic marker scoring, not a domain-expert quality judgment. The companion xAI receipt records that a blind domain expert could not distinguish specific from generic outputs on quality, only on verifiable form.
- One task, one domain (the fictional Northvane scenario). March 2026 models.