Docs

Benchmarks

A weekly scorecard that answers whether your team gets better answers with Amdahl than without it - real production prompts run through both paths, scored by an LLM judge into a single weighted quality score and delta

What it measures

Every week Amdahl scores the questions your team actually asked - the prompts that ran through your agents and MCP tools - not a synthetic suite. The methodology is the same each run:

  • Cluster. Group the week's analytical prompts into themes, so one verdict can stand in for a whole family of similar questions.
  • Run both paths. Send each cluster's exemplar prompt through Amdahl's normalized intelligence layer and through raw access to the same model with no Amdahl context.
  • Judge. An LLM-as-judge picks a winner and scores each answer out of 5.
  • Extrapolate. Weight each verdict by its cluster size and roll it up into one weighted Amdahl score, one weighted Raw score, and the delta between them - your quality lift.

The result is a number you can track week over week: how much better your customer-context engine makes the answers, expressed as a score out of 5 and a percentage lift, plus the per-cluster win / loss / tie split.

In the app

Open Benchmarks from the sidebar. The top of the page is the headline for the latest completed week - Amdahl vs Raw, the delta, the quality lift, and how many clusters favored each side. Below it a trend strip shows the last few weeks at a glance, and a runs table lists every recent run. Click a run to drill into its per-cluster verdicts, down to the exemplar prompt and both answers side by side.

Benchmarks run automatically every Sunday, but you don't have to wait. Hit Run benchmark and pick a week (this week back to about a quarter ago). It defaults to the most recent completed week - an in-progress week has nothing to score yet - and re-runs are safe, so you can always refresh a number.

The Benchmarks screen shows the headline for the latest week, a trend strip of recent weeks, and a runs table that drills into per-cluster verdicts.

Apps & Access gating

Benchmarks is a console segment, so it's gated the same two ways every segment is:

  • Entitlement - a workspace turns the whole segment on or off in Settings -> Apps. If you don't see Benchmarks in the sidebar, it's switched off here.
  • Surface access - in Settings -> Access a role or member gets Hidden / Read / Write on the Benchmarks surface. Read = view runs and drill into them; Write = also trigger a run.

As everywhere in Amdahl, the grandfather default is enabled + full access until a tenant sets a policy.

Over the API

Benchmarks is a customer-facing read surface - the same one the app uses - mounted at /api/businesses/:businessId/benchmarks. You reach it with your console session (signed-in user + workspace membership), not the X-API-Key platform API. It is not on MCP: a weekly quality scorecard is a console view a human reads, not an agent action.

  • GET /api/businesses/:businessId/benchmarks?limit=12 - recent weekly runs, newest first.
  • GET /api/businesses/:businessId/benchmarks/:runId - one run with its per-cluster summary.
  • GET /api/businesses/:businessId/benchmarks/:runId/clusters/:clusterId - drill one cluster: full verdict, the exemplar prompt, and both answers.
  • POST /api/businesses/:businessId/benchmarks/trigger - kick off a run. Optional weekStart (ISO date); it defaults to the most recent completed week, and naming a week that hasn't finished is rejected.

The list response comes back camelCase through the typed client:

json
{
  "runs": [
    {
      "id": "f1e2d3c4-b5a6-4789-90ab-cdef01234567",
      "weekStart": "2026-06-01",
      "status": "complete",
      "weightedAmdahl": 4.31,
      "weightedRaw": 3.42,
      "weightedDelta": 0.89,
      "amdahlWins": 14,
      "rawWins": 3,
      "ties": 1,
      "totalClusters": 18,
      "analyticalPrompts": 96,
      "totalPrompts": 240
    }
  ]
}

A run starts at status: "running" with null scores and fills in when the judge finishes (typically 5-10 minutes), so a client can poll the list until status flips to complete.