Benchmarks

A weekly scorecard that answers whether your team gets better answers with Amdahl than without it - real production prompts run through both paths, scored by an LLM judge into a single weighted quality score and delta

What it measures

Every week Amdahl scores the questions your team actually asked - the prompts that ran through your agents and MCP tools - not a synthetic suite. The methodology is the same each run:

Cluster. Group the week's analytical prompts into themes, so one verdict can stand in for a whole family of similar questions.
Run both paths. Send each cluster's exemplar prompt through Amdahl's normalized intelligence layer and through raw access to the same model with no Amdahl context.
Judge. An LLM-as-judge picks a winner and scores each answer out of 5.
Extrapolate. Weight each verdict by its cluster size and roll it up into one weighted Amdahl score, one weighted Raw score, and the delta between them - your quality lift.

The result is a number you can track week over week: how much better your customer-context engine makes the answers, expressed as a score out of 5 and a percentage lift, plus the per-cluster win / loss / tie split.

In the app

Open Benchmarks from the sidebar. The top of the page is the headline for the latest completed week - Amdahl vs Raw, the delta, the quality lift, and how many clusters favored each side. Below it a trend strip shows the last few weeks at a glance, and a runs table lists every recent run. Click a run to drill into its per-cluster verdicts, down to the exemplar prompt and both answers side by side.

Benchmarks run automatically every Sunday, but you don't have to wait. Hit Run benchmark and pick a week (this week back to about a quarter ago). It defaults to the most recent completed week - an in-progress week has nothing to score yet - and re-runs are safe, so you can always refresh a number.

The Benchmarks screen shows the headline for the latest week, a trend strip of recent weeks, and a runs table that drills into per-cluster verdicts.

Apps & Access gating

Benchmarks is a console segment, so it's gated the same two ways every segment is:

Entitlement - a workspace turns the whole segment on or off in Settings -> Apps. If you don't see Benchmarks in the sidebar, it's switched off here.
Surface access - in Settings -> Access a role or member gets Hidden / Read / Write on the Benchmarks surface. Read = view runs and drill into them; Write = also trigger a run.

As everywhere in Amdahl, the grandfather default is enabled + full access until a tenant sets a policy.

Over the API

Benchmarks is a customer-facing read surface - the same one the app uses - mounted at /api/businesses/:businessId/benchmarks. You reach it with your console session (signed-in user + workspace membership), not the X-API-Key platform API. It is not on MCP: a weekly quality scorecard is a console view a human reads, not an agent action.

GET /api/businesses/:businessId/benchmarks?limit=12 - recent weekly runs, newest first.
GET /api/businesses/:businessId/benchmarks/:runId - one run with its per-cluster summary.
GET /api/businesses/:businessId/benchmarks/:runId/clusters/:clusterId - drill one cluster: full verdict, the exemplar prompt, and both answers.
POST /api/businesses/:businessId/benchmarks/trigger - kick off a run. Optional weekStart (ISO date); it defaults to the most recent completed week, and naming a week that hasn't finished is rejected.

The list response comes back camelCase through the typed client:

json

{
  "runs": [
    {
      "id": "f1e2d3c4-b5a6-4789-90ab-cdef01234567",
      "weekStart": "2026-06-01",
      "status": "complete",
      "weightedAmdahl": 4.31,
      "weightedRaw": 3.42,
      "weightedDelta": 0.89,
      "amdahlWins": 14,
      "rawWins": 3,
      "ties": 1,
      "totalClusters": 18,
      "analyticalPrompts": 96,
      "totalPrompts": 240
    }
  ]
}

A run starts at status: "running" with null scores and fills in when the judge finishes (typically 5-10 minutes), so a client can poll the list until status flips to complete.