acdream/docs/plans/2026-05-09-phase-n5b-perf-baseline.md
Erik 083c10c514 docs(N.5b T10): roadmap + ISSUES + CLAUDE.md + perf baseline updates
Document Phase N.5b shipping (terrain on the modern rendering path via
Path C — `TerrainModernRenderer` mirrors WB's `TerrainRenderManager`
pattern but consumes acdream's `LandblockMesh.Build` so retail's
`FSplitNESW` formula stays in lockstep with physics + visual mesh).

Changes:

- `docs/plans/2026-04-11-roadmap.md` — add N.5b row to the Shipped
  table; promote N.5b's "Phases ahead" entry to ✓ SHIPPED with the
  Path C resolution + perf reality check; refresh N.6 scope to note
  Terrain has joined the modern path (legacy `Texture2D` retirement
  scope narrows to Sky + Debug); update top-of-doc Status line.

- `docs/ISSUES.md` — close issue #51 (WB terrain-split formula
  divergence). Move from OPEN to "Recently closed" with the Path C
  resolution: never adopted WB's formula; modern dispatcher uses
  retail's via `LandblockMesh.Build`. References `da56063` (the
  black-terrain fix that landed within the N.5b ship chain).

- `CLAUDE.md` — add `TerrainModernRenderer.cs` to the WB integration
  cribs list with the GL_INVALID_OPERATION caveat (use uvec2 +
  `sampler2DArray(handle)` constructor, NOT direct
  `uniform sampler2DArray` + `glProgramUniformHandleARB`). Update
  the "Currently in flight" preamble: N.6 builds on N.5 + N.5b;
  add an N.5b shipped paragraph linking the perf baseline doc.

- `docs/plans/2026-05-09-phase-n5b-perf-baseline.md` — new doc
  capturing the radius=5 Holtburg perf measurement (modern 6.4-7.0
  µs median vs legacy 1.5 µs — modern is ~4× SLOWER on CPU at
  radius=5). Documents the spec acceptance criterion #5 amendment,
  the architectural wins that DO hold (zero glBindTexture/frame,
  constant-cost dispatch as A.5 raises radius, per-LB frustum cull),
  and the three high-value gotchas surfaced during implementation.

User-memory updates (outside repo, not in this commit):
- `memory/project_phase_n5b_state.md` — full N.5b state file with
  the three gotchas captured.
- `memory/MEMORY.md` — index entry pointing at the state file.

Build: dotnet build green. No code changes in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:03:14 +02:00

4.9 KiB
Raw Blame History

Phase N.5b — terrain perf baseline

Captured: 2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill.

Methodology

Same build (commit at perf measurement: da56063), ACDREAM_WB_DIAG=1. The build included a TEMPORARY ACDREAM_LEGACY_TERRAIN=1 env-var toggle (since retired in T9 deletion of the legacy renderer) that routed Draw through the legacy renderer for direct comparison. Both renderers were constructed and fed AddLandblock / RemoveLandblock in parallel; only one drew per frame; the same Stopwatch wrapped whichever ran.

Numbers

Renderer cpu_us median cpu_us p95 draws/frame Visible LBs
Legacy (TerrainChunkRenderer) 1.5 3.0 1 (1 chunk) 132-143 (whole chunk)
Modern (TerrainModernRenderer) 6.4-7.0 9-14 ~36-51 36-51 (per-LB cull)

(Legacy draws=1 because its 16×16-LB chunking collapses radius=5's 121 visible landblocks into a single chunk, dispatched as one glDrawElements. Modern issues one glMultiDrawElementsIndirect with N=36-51 sub-commands.)

Acceptance criterion

The N.5b spec acceptance criterion 5 read: "CPU dispatcher time at radius=5 ≥10% lower than today's per-LB-binds path." The captured numbers show modern is ~4× HIGHER on CPU at radius=5. The criterion was wrong — at radius=5 in Holtburg, legacy's chunked path was already collapsed to one draw call. The architectural wins of multi-draw indirect manifest at higher chunk counts (A.5 territory).

The spec is amended via this doc: ship N.5b on visual identity + structural correctness rather than CPU savings at radius=5.

Architectural wins of the modern path (real, even when CPU is higher)

  1. Zero glBindTexture per frame. Bindless atlas handles are made resident once at startup; the modern shader samples via sampler2DArray(uvec2 handle). Legacy issued 2 glBindTexture(Texture2DArray) calls per frame.

  2. Constant-cost dispatch. As A.5 raises the streaming radius (next phase), the visible chunk count grows. Legacy scales linearly: at radius=10 (4× chunks) it's 4 glDrawElements calls; at radius=15 (≥9 chunks) it's 9+ calls. Modern stays at exactly 1 glMultiDrawElementsIndirect regardless.

  3. Per-LB frustum culling. Legacy culled at chunk granularity (16×16 LBs); modern culls per-LB. At a typical Holtburg view, ~36-51 of 132 loaded LBs are actually visible; legacy drew the entire 132-LB chunk (3.5× the visible work pushed to GPU vertex/fragment stages, even though CPU dispatch was cheap).

Why modern's CPU was higher at radius=5

Per-frame work in modern (in microseconds-ish budget on this scene):

  • Walk all loaded slots checking visibility (~120 slots) → AABB test each
  • Build DEIC array (51 entries × 20 bytes = 1020 bytes)
  • glBufferSubData(DRAW_INDIRECT_BUFFER, ...) — driver memcpy
  • 2× glProgramUniform2(..., handle.low, handle.high) for atlas handles
  • glBindVertexArray + glMemoryBarrier(GL_COMMAND_BARRIER_BIT) + glMultiDrawElementsIndirect

Legacy's per-frame work:

  • Bind 2 textures
  • Bind one VAO (the chunk)
  • One glDrawElements

The DEIC array build + buffer upload alone is ~3-5µs at radius=5 on this hardware, which is the bulk of the modern overhead. At higher radius, this overhead amortizes: the buffer is similar size, but the alternative (legacy's N draws) grows.

Follow-up work

  • A.5 (next phase) will exercise the higher-radius case where modern wins. Capture a fresh baseline at radius=8 / 10 once A.5 lands.
  • N.6 perf polish can investigate persistent-mapped buffers for the indirect buffer, which would eliminate the per-frame glBufferSubData. Likely small win at radius=5 (single ~1KB upload), bigger at higher radii.
  • GPU-side culling (compute shader generating the DEIC array directly into the indirect buffer) eliminates the CPU slot walk + DEIC build entirely. N.6 or later territory; only worth it if profiling shows the CPU walk is hot.

Lessons captured to memory

memory/project_phase_n5b_state.md records the high-value gotchas surfaced during N.5b implementation. Three particularly bitable ones:

  1. uniform sampler2DArray + glProgramUniformHandleARB is unreliable. Some drivers (NVIDIA Windows in this case) reject the combination with GL_INVALID_OPERATION. Use the uniform uvec2 + sampler2DArray(handle) constructor pattern instead — N.5's mesh_modern uses this, and N.5b's terrain_modern adopted it after the black-terrain regression.

  2. MaybeFlushTerrainDiag underflow. A naive median calc (copy[N - nz/2]) underflows to copy[N] when only one sample has been recorded. Use copy[N - 1 - (nz - 1) / 2] instead.

  3. Visual gate must actually be visually confirmed. "Go" doesn't mean "verified." During N.5b's gate the user said "go" without launching, which masked the black-terrain regression for hours. The gate must include the user reporting actual visual confirmation, not assent to proceed.