All documents from SA-AEO-Bench v1 — pre-registered, reproducible, audit-ready.
188,877 citations across 100 brands in 10 industries and 3 frontier LLMs. Snapshot at 99% data completion, 2026-05-19.
All five documents below were published as the study ran. The pre-registration (osf.io/w4az2) and the protocol PDF were submitted before any LLM was queried. The interim brief, the pre-completion report, and the insights deliverable followed in order as the run progressed.
Download. Read. Reproduce.
- 01OSF pre-registration
SA-AEO-Bench v1 · Open Science Framework form fields
Hypotheses H1–H7, prompt set, scoring rubric, analysis plan, budget ceiling. Submitted before any LLM was queried.
docs/research/sa-aeo-bench-v1-osf-formfields.md - 02Protocol
SA-AEO-Bench v1 · Pre-registration protocol
Full methodology: brand sample, query construction, Latin Square debiasing, Bradley-Terry strength estimation, sycophancy correction.
docs/research/sa-aeo-bench-v1-osf-protocol.pdf - 03Interim brief
SA-AEO-Bench v1 · Interim status brief
Mid-run progress + early signal. For stakeholders following the run live.
docs/research/sa-aeo-bench-v1-interim-brief.pdf - 04Pre-completion · methodology
SA-AEO-Bench v1 · Pre-completion report (formal)
Run status, cost, per-model summary, all seven H1–H7 hypothesis verdicts. Audit-grade.
docs/research/sa-aeo-bench-v1-precompletion-report.pdf - 05Pre-completion · insights
SA-AEO-Bench v1 · The Actual Insights
Per-brand findings, industry deep-dives, leaderboards. Stakeholder-compelling. The headline document.
docs/research/sa-aeo-bench-v1-insights.pdf
The protocol is public. The data is reproducible.
Three things make any AI-search-citation benchmark trustworthy: pre-registration of hypotheses, public methodology, and reproducible raw data. SA-AEO-Bench v1 ships all three. To replicate:
- Read the protocol PDF (link above). Reproduce the prompt set and the scoring rubric.
- Run against your own API keys for GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro. Budget ≈ R25,000 for full coverage.
- Apply Latin Square debiasing on comparison queries and Bradley-Terry strength estimation on the per-brand wins. Both are in the protocol.
- Compare your results to the pre-completion report. Diverging numbers are themselves a finding — they isolate methodology variance from underlying signal.
Email research@citedbrands.co.za if you’d like the raw JSONL records (69MB, ~190k citation rows) for academic replication. Free under attribution.