SA-AEO-Bench v1 · research

All documents from SA-AEO-Bench v1 — pre-registered, reproducible, audit-ready.

188,877 citations across 100 brands in 10 industries and 3 frontier LLMs. Snapshot at 99% data completion, 2026-05-19.

All five documents below were published as the study ran. The pre-registration (osf.io/w4az2) and the protocol PDF were submitted before any LLM was queried. The interim brief, the pre-completion report, and the insights deliverable followed in order as the run progressed.

Documents

Download. Read. Reproduce.

  1. 01
    OSF pre-registration

    SA-AEO-Bench v1 · Open Science Framework form fields

    Hypotheses H1–H7, prompt set, scoring rubric, analysis plan, budget ceiling. Submitted before any LLM was queried.

    docs/research/sa-aeo-bench-v1-osf-formfields.md
  2. 02
    Protocol

    SA-AEO-Bench v1 · Pre-registration protocol

    Full methodology: brand sample, query construction, Latin Square debiasing, Bradley-Terry strength estimation, sycophancy correction.

    docs/research/sa-aeo-bench-v1-osf-protocol.pdf
  3. 03
    Interim brief

    SA-AEO-Bench v1 · Interim status brief

    Mid-run progress + early signal. For stakeholders following the run live.

    docs/research/sa-aeo-bench-v1-interim-brief.pdf
  4. 04
    Pre-completion · methodology

    SA-AEO-Bench v1 · Pre-completion report (formal)

    Run status, cost, per-model summary, all seven H1–H7 hypothesis verdicts. Audit-grade.

    docs/research/sa-aeo-bench-v1-precompletion-report.pdf
  5. 05
    Pre-completion · insights

    SA-AEO-Bench v1 · The Actual Insights

    Per-brand findings, industry deep-dives, leaderboards. Stakeholder-compelling. The headline document.

    docs/research/sa-aeo-bench-v1-insights.pdf
Replicate

The protocol is public. The data is reproducible.

Three things make any AI-search-citation benchmark trustworthy: pre-registration of hypotheses, public methodology, and reproducible raw data. SA-AEO-Bench v1 ships all three. To replicate:

  1. Read the protocol PDF (link above). Reproduce the prompt set and the scoring rubric.
  2. Run against your own API keys for GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro. Budget ≈ R25,000 for full coverage.
  3. Apply Latin Square debiasing on comparison queries and Bradley-Terry strength estimation on the per-brand wins. Both are in the protocol.
  4. Compare your results to the pre-completion report. Diverging numbers are themselves a finding — they isolate methodology variance from underlying signal.

Email research@citedbrands.co.za if you’d like the raw JSONL records (69MB, ~190k citation rows) for academic replication. Free under attribution.

Run the next one with us

Subscribe and get every quarterly bench drop the day it ships.

Subscribe →