Benchmarks
A weekly scorecard that answers whether your team gets better answers with Amdahl than without it - real production prompts run through both paths, scored by an LLM judge into a single weighted quality score and delta
What it measures
Every week Amdahl scores the questions your team actually asked - the prompts that ran through your agents and MCP tools - not a synthetic suite. The methodology is the same each run:
- Cluster. Group the week's analytical prompts into themes, so one verdict can stand in for a whole family of similar questions.
- Run both paths. Send each cluster's exemplar prompt through Amdahl's normalized intelligence layer and through raw access to the same model with no Amdahl context.
- Judge. An LLM-as-judge picks a winner and scores each answer out of 5.
- Extrapolate. Weight each verdict by its cluster size and roll it up into one weighted Amdahl score, one weighted Raw score, and the delta between them - your quality lift.
The result is a number you can track week over week: how much better your customer-context engine makes the answers, expressed as a score out of 5 and a percentage lift, plus the per-cluster win / loss / tie split.
In the app
Open Benchmarks from the sidebar. The top of the page is the headline for the latest completed week - Amdahl vs Raw, the delta, the quality lift, and how many clusters favored each side. Below it a trend strip shows the last few weeks at a glance, and a runs table lists every recent run. Click a run to drill into its per-cluster verdicts, down to the exemplar prompt and both answers side by side.
Benchmarks run automatically every Sunday, but you don't have to wait. Hit Run benchmark and pick a week (this week back to about a quarter ago). It defaults to the most recent completed week - an in-progress week has nothing to score yet - and re-runs are safe, so you can always refresh a number.
The Benchmarks screen shows the headline for the latest week, a trend strip of recent weeks, and a runs table that drills into per-cluster verdicts.
Apps & Access gating
Benchmarks is a console segment, so it's gated the same two ways every segment is:
- Entitlement - a workspace turns the whole segment on or off in Settings -> Apps. If you don't see Benchmarks in the sidebar, it's switched off here.
- Surface access - in Settings -> Access a role or member gets Hidden / Read / Write on the Benchmarks surface. Read = view runs and drill into them; Write = also trigger a run.
As everywhere in Amdahl, the grandfather default is enabled + full access until a tenant sets a policy.
Over the API
Benchmarks is a customer-facing read surface - the same one the app uses - mounted at /api/businesses/:businessId/benchmarks. You reach it with your console session (signed-in user + workspace membership), not the X-API-Key platform API. It is not on MCP: a weekly quality scorecard is a console view a human reads, not an agent action.
GET /api/businesses/:businessId/benchmarks?limit=12- recent weekly runs, newest first.GET /api/businesses/:businessId/benchmarks/:runId- one run with its per-cluster summary.GET /api/businesses/:businessId/benchmarks/:runId/clusters/:clusterId- drill one cluster: full verdict, the exemplar prompt, and both answers.POST /api/businesses/:businessId/benchmarks/trigger- kick off a run. OptionalweekStart(ISO date); it defaults to the most recent completed week, and naming a week that hasn't finished is rejected.
The list response comes back camelCase through the typed client:
{
"runs": [
{
"id": "f1e2d3c4-b5a6-4789-90ab-cdef01234567",
"weekStart": "2026-06-01",
"status": "complete",
"weightedAmdahl": 4.31,
"weightedRaw": 3.42,
"weightedDelta": 0.89,
"amdahlWins": 14,
"rawWins": 3,
"ties": 1,
"totalClusters": 18,
"analyticalPrompts": 96,
"totalPrompts": 240
}
]
}A run starts at status: "running" with null scores and fills in when the judge finishes (typically 5-10 minutes), so a client can poll the list until status flips to complete.