Metric harness
Why a harness
Section titled “Why a harness”AI product changes are easy to ship and hard to evaluate. The team adds a new prompt feature, declares it a win, and nobody measures whether it actually helps. The metric harness gives us a deterministic way to measure paired (control, treatment) plan outcomes and emit a qualitative verdict.
Lives at backend/src/services/project-graph/metric-harness.ts with
a CLI at scripts/pg-metric-harness.ts.
Shape of a run
Section titled “Shape of a run”export interface RunResult { condition: "control" | "treatment"; fixtureId: string; critiqueScore: number | null; // 0..25 revised: boolean; landmarksUsed: number; planned: boolean;}One run = one plan decomposition of one fixture (a synthetic ticket or a real archived one).
Aggregation
Section titled “Aggregation”aggregateRuns(runs) returns:
control+treatmentstats (n, mean, median, pass-rate, revise-rate).meanScoreDelta— treatment − control.passRateDelta— same.- Raw rows for downstream analysis (T-test, Wilcoxon, etc).
$ npx tsx scripts/pg-metric-harness.ts runs.json
════════════════════════════════════════════════════════════ PG.14 — God-nodes prompt metric validation════════════════════════════════════════════════════════════ fixtures: control n=3, treatment n=3
control treatment delta ------------------------------------------------------ planned 3 3 0 mean score 13.67 20.33 6.66 median score 14 20 — pass rate ≥20 0 0.67 0.67 revise rate 0.67 0 —════════════════════════════════════════════════════════════ verdict: improved note: mean +6.66 / pass-rate +67%Verdict rules
Section titled “Verdict rules”Conservative thresholds (from interpretReport):
- improved — meanScoreDelta ≥ 1 AND passRateDelta ≥ 0.
- regressed — meanScoreDelta ≤ −1.
- flat — neither.
- insufficient-data — one arm has no scored runs.
We don’t compute confidence intervals; sample size needed depends on your corpus. Feed the raw rows to R / numpy if you need significance testing.
How to produce a paired run set
Section titled “How to produce a paired run set”Option A — real archive replay:
- Export 50 recent real plans with their critique scores (attempt 1, not replans).
- For each, rerun the planner with and without the feature under
test, capturing
(condition, fixtureId, critiqueScore, …). - Write to
runs.json.
Option B — synthetic fixtures:
- Hand-write 10 plausible ticket descriptions covering different scope shapes.
- Run the planner against each, with and without the feature.
- Capture the same fields.
Option A is more realistic but requires a real corpus. Option B is enough for smoke testing a prompt change.
Wiring into CI
Section titled “Wiring into CI”The CLI exits with code 2 on regressed. Put it in your PR checks:
- name: Metric harness run: | npx tsx scripts/pg-metric-harness.ts ./eval-runs.jsonUse the insufficient-data / flat exit code (0) for non-blocking
advisory runs.
What it doesn’t do
Section titled “What it doesn’t do”- Compare two different models head-to-head. (Possible, but the
comparison knob is “condition” — add
variantif you need more arms.) - Statistical significance testing. Deliberately out-of-scope; plug in your stats tool of choice on the raw rows.
- Automate the planner runs. You produce
runs.json; the harness just aggregates it.
Future: continuous eval
Section titled “Future: continuous eval”Plausible shape:
- Nightly cron re-runs the planner against the last 7 days of real briefs.
- Compares to the previous nightly run’s results.
- Posts a Slack notification if
verdict: regressed.
Not yet built. Would live as a cron job pattern in
backend/src/services/scheduled-jobs/.
The reason we built this
Section titled “The reason we built this”Internally, we landed a feature (PG.6 — god-nodes in planner prompt) claiming it helped planner quality. When someone asked “how much?”, the honest answer was “we don’t know — we eyeballed it.” Shipping the harness was forcing ourselves to either back the claim with numbers or walk it back. That’s the discipline we want every future prompt change to face.