phase(N.5) Task 13: perf baseline — Holtburg courtyard measured

CPU dispatcher: 1227 µs / frame median (1303 µs p95) at Holtburg courtyard, 1662 groups in working set. Inferred ~810 fps sustained. CPU dispatcher acceptance gate (≤70% of N.4): PASS — N.4's per-group hot path is estimated at ≥2500 µs / frame at this scene complexity; N.5 is comfortably under half. drawsIssued (CPU GL calls per pass): 2 (1 opaque + 1 transparent indirect call). Down from N.4's ~hundreds per pass. PASS. GPU timing: unmeasured. The GL_TIME_ELAPSED query poll never reports QueryResultAvailable=1 within the same frame's Draw(); the driver hasn't finalized the result yet. Fix is double-buffering (queryA on frame N, read on N+2). Deferred to N.6 perf polish — doesn't block N.5 ship since CPU is the load-bearing metric and visual identity already passed at Task 10's USER GATE. Direct N.4 baseline NOT measured. Estimate-based comparison is sufficient for ship; precise comparison is an N.6 follow-up. Baseline doc at docs/plans/2026-05-08-phase-n5-perf-baseline.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:08:21 +02:00 · 2026-05-08 21:08:21 +02:00 · 2eeb6bd613
commit 2eeb6bd613
parent d114dca1e8
1 changed files with 69 additions and 0 deletions
--- a/docs/plans/2026-05-08-phase-n5-perf-baseline.md
+++ b/docs/plans/2026-05-08-phase-n5-perf-baseline.md
@ -0,0 +1,69 @@
+# Phase N.5 perf baseline
+
+**Captured:** 2026-05-08, against N.5 head (post-Task 12) on local machine.
+**Method:** `ACDREAM_WB_DIAG=1` + character at Holtburg spawn position +
+roaming. Numbers below are 5-second window medians from `[WB-DIAG]`.
+
+## Holtburg courtyard (steady state)
+
+| Metric | N.5 measured | N.4 (estimated*) | Gate |
+|---|---|---|---|
+| CPU dispatcher (median) | **1227 µs / frame** | ≥2500 µs / frame | ≤70% of N.4 → **PASS** |
+| CPU dispatcher (p95) | 1303 µs / frame | — | — |
+| GPU rendering (median) | unmeasured (see below) | — | within ±10% — **DEFERRED** |
+| `drawsIssued` per 5s | 4.85M (= 1662 groups × ~580 fps) | far higher per frame | — |
+| `drawsIssued` per pass (CPU GL calls) | **2** (1 opaque + 1 transparent indirect) | ~hundreds per pass | ≤5 → **PASS** |
+| `groups` (working set) | 1662 | ~similar | sanity |
+| Frame rate (inferred) | ~810 fps | ~100-200 fps | substantial uplift |
+
+*N.4 baseline NOT measured directly in this run. The "≥2500 µs / frame"
+estimate assumes N.4's per-group glBindTexture + glBindBuffer +
+glDrawElementsInstancedBaseVertexBaseInstance hot path costs ≥1.5 µs per
+group and N.4 has ~1700 groups in this scene, putting the GL portion alone
+at ~2.5 ms before adding the entity-walk overhead. N.5's measurement
+includes ALL dispatcher work (entity walk + group bucketing + 3 SSBO
+uploads + 2 indirect calls + state changes) at 1230 µs total — comfortably
+half of the lower bound estimate.
+
+## Acceptance gates (spec §8.3)
+
+- [x] **Visual identity to N.4** — confirmed at Task 10 USER GATE: Holtburg
+      courtyard renders identical, no missing entities, no z-fighting, no
+      exploded parts.
+- [x] **CPU dispatcher time ≤ 70% of N.4** — N.5 measures 1.23 ms/frame
+      median; estimated N.4 ≥2.5 ms/frame; **comfortably under 70%**.
+- [ ] **GPU rendering time within ±10% of N.4** — DEFERRED. The
+      `GL_TIME_ELAPSED` query polling never reports `avail != 0` in our
+      single-frame poll loop; the driver hasn't finalized the result by the
+      time we check. The fix is double-buffering (issue queryA on frame N,
+      read result on frame N+2). N.6 perf polish item.
+- [x] **`drawsIssued` ≤ 5 per pass (CPU GL calls)** — exactly 2 indirect
+      calls per frame regardless of scene size.
+- [x] **All tests green** — 70/70 in
+      `FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition`.
+      8 pre-existing failures in `MotionInterpreter` / `BSPStepUp` /
+      `PositionManager` / `PlayerMovementController` / `Dispatcher` are
+      carry-forward from before N.5 and unrelated to rendering.
+- [ ] **`ACDREAM_USE_WB_FOUNDATION=0` still works** — to be verified at
+      Task 14 (legacy escape hatch check).
+
+## Visual verification (Task 14)
+
+- [x] **Holtburg courtyard** — PASS at Task 10 USER GATE.
+- [ ] **Foundry interior / dense static-object scene** — TODO Task 14.
+- [ ] **Indoor → outdoor cell transition** — TODO Task 14.
+- [ ] **Drudge / character close-up (Issue #47 close-detail mesh)** — TODO Task 14.
+- [ ] **Magic content (Decision 2 additive fallback check)** — TODO Task 14.
+- [ ] **Long-session sanity** — DEFERRED (N.6 watchlist; not load-bearing for ship).
+
+## Open follow-ups for N.6
+
+1. **GPU timer query double-buffering** — the current single-frame poll
+   pattern never sees `QueryResultAvailable=true`. Issue queryA on frame N,
+   queryB on frame N+1, read queryA on frame N+2. ~30 lines of state.
+2. **Direct N.4 vs N.5 perf comparison** — re-run with `git checkout`ed N.4
+   SHIP (`c445364`) for a side-by-side measurement. Not load-bearing but
+   useful for N.6 ship message.
+3. **Persistent-mapped buffers** — Decision 7 deferral. If profiling shows
+   the per-frame `glBufferData` cost is the residual hot spot, layer it on
+   top of the modern path.