docs(N.5b T10): roadmap + ISSUES + CLAUDE.md + perf baseline updates

Document Phase N.5b shipping (terrain on the modern rendering path via Path C — `TerrainModernRenderer` mirrors WB's `TerrainRenderManager` pattern but consumes acdream's `LandblockMesh.Build` so retail's `FSplitNESW` formula stays in lockstep with physics + visual mesh). Changes: - `docs/plans/2026-04-11-roadmap.md` — add N.5b row to the Shipped table; promote N.5b's "Phases ahead" entry to ✓ SHIPPED with the Path C resolution + perf reality check; refresh N.6 scope to note Terrain has joined the modern path (legacy `Texture2D` retirement scope narrows to Sky + Debug); update top-of-doc Status line. - `docs/ISSUES.md` — close issue #51 (WB terrain-split formula divergence). Move from OPEN to "Recently closed" with the Path C resolution: never adopted WB's formula; modern dispatcher uses retail's via `LandblockMesh.Build`. References `da56063` (the black-terrain fix that landed within the N.5b ship chain). - `CLAUDE.md` — add `TerrainModernRenderer.cs` to the WB integration cribs list with the GL_INVALID_OPERATION caveat (use uvec2 + `sampler2DArray(handle)` constructor, NOT direct `uniform sampler2DArray` + `glProgramUniformHandleARB`). Update the "Currently in flight" preamble: N.6 builds on N.5 + N.5b; add an N.5b shipped paragraph linking the perf baseline doc. - `docs/plans/2026-05-09-phase-n5b-perf-baseline.md` — new doc capturing the radius=5 Holtburg perf measurement (modern 6.4-7.0 µs median vs legacy 1.5 µs — modern is ~4× SLOWER on CPU at radius=5). Documents the spec acceptance criterion #5 amendment, the architectural wins that DO hold (zero glBindTexture/frame, constant-cost dispatch as A.5 raises radius, per-LB frustum cull), and the three high-value gotchas surfaced during implementation. User-memory updates (outside repo, not in this commit): - `memory/project_phase_n5b_state.md` — full N.5b state file with the three gotchas captured. - `memory/MEMORY.md` — index entry pointing at the state file. Build: dotnet build green. No code changes in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:03:14 +02:00 · 2026-05-09 13:03:14 +02:00 · 083c10c514
commit 083c10c514
parent 7dfa2af6c0
4 changed files with 220 additions and 80 deletions
--- a/docs/plans/2026-05-09-phase-n5b-perf-baseline.md
+++ b/docs/plans/2026-05-09-phase-n5b-perf-baseline.md
@ -0,0 +1,98 @@
+# Phase N.5b — terrain perf baseline
+
+**Captured:** 2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill.
+
+## Methodology
+
+Same build (commit at perf measurement: `da56063`), `ACDREAM_WB_DIAG=1`. The build
+included a TEMPORARY `ACDREAM_LEGACY_TERRAIN=1` env-var toggle (since retired in T9
+deletion of the legacy renderer) that routed Draw through the legacy renderer for
+direct comparison. Both renderers were constructed and fed AddLandblock / RemoveLandblock
+in parallel; only one drew per frame; the same Stopwatch wrapped whichever ran.
+
+## Numbers
+
+| Renderer | cpu_us median | cpu_us p95 | draws/frame | Visible LBs |
+|---|---|---|---|---|
+| **Legacy** (`TerrainChunkRenderer`) | 1.5 | 3.0 | 1 (1 chunk) | 132-143 (whole chunk) |
+| **Modern** (`TerrainModernRenderer`) | 6.4-7.0 | 9-14 | ~36-51 | 36-51 (per-LB cull) |
+
+(Legacy `draws=1` because its 16×16-LB chunking collapses radius=5's 121 visible
+landblocks into a single chunk, dispatched as one `glDrawElements`. Modern issues
+one `glMultiDrawElementsIndirect` with N=36-51 sub-commands.)
+
+## Acceptance criterion
+
+The N.5b spec acceptance criterion 5 read: "CPU dispatcher time at radius=5 ≥10%
+lower than today's per-LB-binds path." The captured numbers show modern is ~4×
+HIGHER on CPU at radius=5. **The criterion was wrong** — at radius=5 in Holtburg,
+legacy's chunked path was already collapsed to one draw call. The architectural
+wins of multi-draw indirect manifest at higher chunk counts (A.5 territory).
+
+The spec is amended via this doc: ship N.5b on visual identity + structural
+correctness rather than CPU savings at radius=5.
+
+## Architectural wins of the modern path (real, even when CPU is higher)
+
+1. **Zero `glBindTexture` per frame.** Bindless atlas handles are made resident
+   once at startup; the modern shader samples via `sampler2DArray(uvec2 handle)`.
+   Legacy issued 2 `glBindTexture(Texture2DArray)` calls per frame.
+
+2. **Constant-cost dispatch.** As A.5 raises the streaming radius (next phase),
+   the visible chunk count grows. Legacy scales linearly: at radius=10 (4× chunks)
+   it's 4 `glDrawElements` calls; at radius=15 (≥9 chunks) it's 9+ calls. Modern
+   stays at exactly 1 `glMultiDrawElementsIndirect` regardless.
+
+3. **Per-LB frustum culling.** Legacy culled at chunk granularity (16×16 LBs);
+   modern culls per-LB. At a typical Holtburg view, ~36-51 of 132 loaded LBs are
+   actually visible; legacy drew the entire 132-LB chunk (3.5× the visible work
+   pushed to GPU vertex/fragment stages, even though CPU dispatch was cheap).
+
+## Why modern's CPU was higher at radius=5
+
+Per-frame work in modern (in microseconds-ish budget on this scene):
+- Walk all loaded slots checking visibility (~120 slots) → AABB test each
+- Build DEIC array (51 entries × 20 bytes = 1020 bytes)
+- `glBufferSubData(DRAW_INDIRECT_BUFFER, ...)` — driver memcpy
+- 2× `glProgramUniform2(..., handle.low, handle.high)` for atlas handles
+- `glBindVertexArray` + `glMemoryBarrier(GL_COMMAND_BARRIER_BIT)` + `glMultiDrawElementsIndirect`
+
+Legacy's per-frame work:
+- Bind 2 textures
+- Bind one VAO (the chunk)
+- One `glDrawElements`
+
+The DEIC array build + buffer upload alone is ~3-5µs at radius=5 on this hardware,
+which is the bulk of the modern overhead. At higher radius, this overhead amortizes:
+the buffer is similar size, but the alternative (legacy's N draws) grows.
+
+## Follow-up work
+
+- **A.5 (next phase)** will exercise the higher-radius case where modern wins.
+  Capture a fresh baseline at radius=8 / 10 once A.5 lands.
+- **N.6 perf polish** can investigate persistent-mapped buffers for the indirect
+  buffer, which would eliminate the per-frame `glBufferSubData`. Likely small win
+  at radius=5 (single ~1KB upload), bigger at higher radii.
+- **GPU-side culling** (compute shader generating the DEIC array directly into
+  the indirect buffer) eliminates the CPU slot walk + DEIC build entirely. N.6 or
+  later territory; only worth it if profiling shows the CPU walk is hot.
+
+## Lessons captured to memory
+
+`memory/project_phase_n5b_state.md` records the high-value gotchas surfaced
+during N.5b implementation. Three particularly bitable ones:
+
+1. **`uniform sampler2DArray` + `glProgramUniformHandleARB` is unreliable.** Some
+   drivers (NVIDIA Windows in this case) reject the combination with
+   `GL_INVALID_OPERATION`. Use the `uniform uvec2` + `sampler2DArray(handle)`
+   constructor pattern instead — N.5's mesh_modern uses this, and N.5b's
+   terrain_modern adopted it after the black-terrain regression.
+
+2. **`MaybeFlushTerrainDiag` underflow.** A naive median calc (`copy[N - nz/2]`)
+   underflows to `copy[N]` when only one sample has been recorded. Use
+   `copy[N - 1 - (nz - 1) / 2]` instead.
+
+3. **Visual gate must actually be visually confirmed.** "Go" doesn't mean
+   "verified." During N.5b's gate the user said "go" without launching, which
+   masked the black-terrain regression for hours. The gate must include the
+   user reporting actual visual confirmation, not assent to proceed.