acdream/docs/plans/2026-05-09-phase-n5b-perf-baseline.md
Erik 083c10c514 docs(N.5b T10): roadmap + ISSUES + CLAUDE.md + perf baseline updates
Document Phase N.5b shipping (terrain on the modern rendering path via
Path C — `TerrainModernRenderer` mirrors WB's `TerrainRenderManager`
pattern but consumes acdream's `LandblockMesh.Build` so retail's
`FSplitNESW` formula stays in lockstep with physics + visual mesh).

Changes:

- `docs/plans/2026-04-11-roadmap.md` — add N.5b row to the Shipped
  table; promote N.5b's "Phases ahead" entry to ✓ SHIPPED with the
  Path C resolution + perf reality check; refresh N.6 scope to note
  Terrain has joined the modern path (legacy `Texture2D` retirement
  scope narrows to Sky + Debug); update top-of-doc Status line.

- `docs/ISSUES.md` — close issue #51 (WB terrain-split formula
  divergence). Move from OPEN to "Recently closed" with the Path C
  resolution: never adopted WB's formula; modern dispatcher uses
  retail's via `LandblockMesh.Build`. References `da56063` (the
  black-terrain fix that landed within the N.5b ship chain).

- `CLAUDE.md` — add `TerrainModernRenderer.cs` to the WB integration
  cribs list with the GL_INVALID_OPERATION caveat (use uvec2 +
  `sampler2DArray(handle)` constructor, NOT direct
  `uniform sampler2DArray` + `glProgramUniformHandleARB`). Update
  the "Currently in flight" preamble: N.6 builds on N.5 + N.5b;
  add an N.5b shipped paragraph linking the perf baseline doc.

- `docs/plans/2026-05-09-phase-n5b-perf-baseline.md` — new doc
  capturing the radius=5 Holtburg perf measurement (modern 6.4-7.0
  µs median vs legacy 1.5 µs — modern is ~4× SLOWER on CPU at
  radius=5). Documents the spec acceptance criterion #5 amendment,
  the architectural wins that DO hold (zero glBindTexture/frame,
  constant-cost dispatch as A.5 raises radius, per-LB frustum cull),
  and the three high-value gotchas surfaced during implementation.

User-memory updates (outside repo, not in this commit):
- `memory/project_phase_n5b_state.md` — full N.5b state file with
  the three gotchas captured.
- `memory/MEMORY.md` — index entry pointing at the state file.

Build: dotnet build green. No code changes in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:03:14 +02:00

98 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase N.5b — terrain perf baseline
**Captured:** 2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill.
## Methodology
Same build (commit at perf measurement: `da56063`), `ACDREAM_WB_DIAG=1`. The build
included a TEMPORARY `ACDREAM_LEGACY_TERRAIN=1` env-var toggle (since retired in T9
deletion of the legacy renderer) that routed Draw through the legacy renderer for
direct comparison. Both renderers were constructed and fed AddLandblock / RemoveLandblock
in parallel; only one drew per frame; the same Stopwatch wrapped whichever ran.
## Numbers
| Renderer | cpu_us median | cpu_us p95 | draws/frame | Visible LBs |
|---|---|---|---|---|
| **Legacy** (`TerrainChunkRenderer`) | 1.5 | 3.0 | 1 (1 chunk) | 132-143 (whole chunk) |
| **Modern** (`TerrainModernRenderer`) | 6.4-7.0 | 9-14 | ~36-51 | 36-51 (per-LB cull) |
(Legacy `draws=1` because its 16×16-LB chunking collapses radius=5's 121 visible
landblocks into a single chunk, dispatched as one `glDrawElements`. Modern issues
one `glMultiDrawElementsIndirect` with N=36-51 sub-commands.)
## Acceptance criterion
The N.5b spec acceptance criterion 5 read: "CPU dispatcher time at radius=5 ≥10%
lower than today's per-LB-binds path." The captured numbers show modern is ~4×
HIGHER on CPU at radius=5. **The criterion was wrong** — at radius=5 in Holtburg,
legacy's chunked path was already collapsed to one draw call. The architectural
wins of multi-draw indirect manifest at higher chunk counts (A.5 territory).
The spec is amended via this doc: ship N.5b on visual identity + structural
correctness rather than CPU savings at radius=5.
## Architectural wins of the modern path (real, even when CPU is higher)
1. **Zero `glBindTexture` per frame.** Bindless atlas handles are made resident
once at startup; the modern shader samples via `sampler2DArray(uvec2 handle)`.
Legacy issued 2 `glBindTexture(Texture2DArray)` calls per frame.
2. **Constant-cost dispatch.** As A.5 raises the streaming radius (next phase),
the visible chunk count grows. Legacy scales linearly: at radius=10 (4× chunks)
it's 4 `glDrawElements` calls; at radius=15 (≥9 chunks) it's 9+ calls. Modern
stays at exactly 1 `glMultiDrawElementsIndirect` regardless.
3. **Per-LB frustum culling.** Legacy culled at chunk granularity (16×16 LBs);
modern culls per-LB. At a typical Holtburg view, ~36-51 of 132 loaded LBs are
actually visible; legacy drew the entire 132-LB chunk (3.5× the visible work
pushed to GPU vertex/fragment stages, even though CPU dispatch was cheap).
## Why modern's CPU was higher at radius=5
Per-frame work in modern (in microseconds-ish budget on this scene):
- Walk all loaded slots checking visibility (~120 slots) → AABB test each
- Build DEIC array (51 entries × 20 bytes = 1020 bytes)
- `glBufferSubData(DRAW_INDIRECT_BUFFER, ...)` — driver memcpy
- 2× `glProgramUniform2(..., handle.low, handle.high)` for atlas handles
- `glBindVertexArray` + `glMemoryBarrier(GL_COMMAND_BARRIER_BIT)` + `glMultiDrawElementsIndirect`
Legacy's per-frame work:
- Bind 2 textures
- Bind one VAO (the chunk)
- One `glDrawElements`
The DEIC array build + buffer upload alone is ~3-5µs at radius=5 on this hardware,
which is the bulk of the modern overhead. At higher radius, this overhead amortizes:
the buffer is similar size, but the alternative (legacy's N draws) grows.
## Follow-up work
- **A.5 (next phase)** will exercise the higher-radius case where modern wins.
Capture a fresh baseline at radius=8 / 10 once A.5 lands.
- **N.6 perf polish** can investigate persistent-mapped buffers for the indirect
buffer, which would eliminate the per-frame `glBufferSubData`. Likely small win
at radius=5 (single ~1KB upload), bigger at higher radii.
- **GPU-side culling** (compute shader generating the DEIC array directly into
the indirect buffer) eliminates the CPU slot walk + DEIC build entirely. N.6 or
later territory; only worth it if profiling shows the CPU walk is hot.
## Lessons captured to memory
`memory/project_phase_n5b_state.md` records the high-value gotchas surfaced
during N.5b implementation. Three particularly bitable ones:
1. **`uniform sampler2DArray` + `glProgramUniformHandleARB` is unreliable.** Some
drivers (NVIDIA Windows in this case) reject the combination with
`GL_INVALID_OPERATION`. Use the `uniform uvec2` + `sampler2DArray(handle)`
constructor pattern instead — N.5's mesh_modern uses this, and N.5b's
terrain_modern adopted it after the black-terrain regression.
2. **`MaybeFlushTerrainDiag` underflow.** A naive median calc (`copy[N - nz/2]`)
underflows to `copy[N]` when only one sample has been recorded. Use
`copy[N - 1 - (nz - 1) / 2]` instead.
3. **Visual gate must actually be visually confirmed.** "Go" doesn't mean
"verified." During N.5b's gate the user said "go" without launching, which
masked the black-terrain regression for hours. The gate must include the
user reporting actual visual confirmation, not assent to proceed.