# Phase N.5b — terrain perf baseline **Captured:** 2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill. ## Methodology Same build (commit at perf measurement: `da56063`), `ACDREAM_WB_DIAG=1`. The build included a TEMPORARY `ACDREAM_LEGACY_TERRAIN=1` env-var toggle (since retired in T9 deletion of the legacy renderer) that routed Draw through the legacy renderer for direct comparison. Both renderers were constructed and fed AddLandblock / RemoveLandblock in parallel; only one drew per frame; the same Stopwatch wrapped whichever ran. ## Numbers | Renderer | cpu_us median | cpu_us p95 | draws/frame | Visible LBs | |---|---|---|---|---| | **Legacy** (`TerrainChunkRenderer`) | 1.5 | 3.0 | 1 (1 chunk) | 132-143 (whole chunk) | | **Modern** (`TerrainModernRenderer`) | 6.4-7.0 | 9-14 | ~36-51 | 36-51 (per-LB cull) | (Legacy `draws=1` because its 16×16-LB chunking collapses radius=5's 121 visible landblocks into a single chunk, dispatched as one `glDrawElements`. Modern issues one `glMultiDrawElementsIndirect` with N=36-51 sub-commands.) ## Acceptance criterion The N.5b spec acceptance criterion 5 read: "CPU dispatcher time at radius=5 ≥10% lower than today's per-LB-binds path." The captured numbers show modern is ~4× HIGHER on CPU at radius=5. **The criterion was wrong** — at radius=5 in Holtburg, legacy's chunked path was already collapsed to one draw call. The architectural wins of multi-draw indirect manifest at higher chunk counts (A.5 territory). The spec is amended via this doc: ship N.5b on visual identity + structural correctness rather than CPU savings at radius=5. ## Architectural wins of the modern path (real, even when CPU is higher) 1. **Zero `glBindTexture` per frame.** Bindless atlas handles are made resident once at startup; the modern shader samples via `sampler2DArray(uvec2 handle)`. Legacy issued 2 `glBindTexture(Texture2DArray)` calls per frame. 2. **Constant-cost dispatch.** As A.5 raises the streaming radius (next phase), the visible chunk count grows. Legacy scales linearly: at radius=10 (4× chunks) it's 4 `glDrawElements` calls; at radius=15 (≥9 chunks) it's 9+ calls. Modern stays at exactly 1 `glMultiDrawElementsIndirect` regardless. 3. **Per-LB frustum culling.** Legacy culled at chunk granularity (16×16 LBs); modern culls per-LB. At a typical Holtburg view, ~36-51 of 132 loaded LBs are actually visible; legacy drew the entire 132-LB chunk (3.5× the visible work pushed to GPU vertex/fragment stages, even though CPU dispatch was cheap). ## Why modern's CPU was higher at radius=5 Per-frame work in modern (in microseconds-ish budget on this scene): - Walk all loaded slots checking visibility (~120 slots) → AABB test each - Build DEIC array (51 entries × 20 bytes = 1020 bytes) - `glBufferSubData(DRAW_INDIRECT_BUFFER, ...)` — driver memcpy - 2× `glProgramUniform2(..., handle.low, handle.high)` for atlas handles - `glBindVertexArray` + `glMemoryBarrier(GL_COMMAND_BARRIER_BIT)` + `glMultiDrawElementsIndirect` Legacy's per-frame work: - Bind 2 textures - Bind one VAO (the chunk) - One `glDrawElements` The DEIC array build + buffer upload alone is ~3-5µs at radius=5 on this hardware, which is the bulk of the modern overhead. At higher radius, this overhead amortizes: the buffer is similar size, but the alternative (legacy's N draws) grows. ## Follow-up work - **A.5 (next phase)** will exercise the higher-radius case where modern wins. Capture a fresh baseline at radius=8 / 10 once A.5 lands. - **N.6 perf polish** can investigate persistent-mapped buffers for the indirect buffer, which would eliminate the per-frame `glBufferSubData`. Likely small win at radius=5 (single ~1KB upload), bigger at higher radii. - **GPU-side culling** (compute shader generating the DEIC array directly into the indirect buffer) eliminates the CPU slot walk + DEIC build entirely. N.6 or later territory; only worth it if profiling shows the CPU walk is hot. ## Lessons captured to memory `memory/project_phase_n5b_state.md` records the high-value gotchas surfaced during N.5b implementation. Three particularly bitable ones: 1. **`uniform sampler2DArray` + `glProgramUniformHandleARB` is unreliable.** Some drivers (NVIDIA Windows in this case) reject the combination with `GL_INVALID_OPERATION`. Use the `uniform uvec2` + `sampler2DArray(handle)` constructor pattern instead — N.5's mesh_modern uses this, and N.5b's terrain_modern adopted it after the black-terrain regression. 2. **`MaybeFlushTerrainDiag` underflow.** A naive median calc (`copy[N - nz/2]`) underflows to `copy[N]` when only one sample has been recorded. Use `copy[N - 1 - (nz - 1) / 2]` instead. 3. **Visual gate must actually be visually confirmed.** "Go" doesn't mean "verified." During N.5b's gate the user said "go" without launching, which masked the black-terrain regression for hours. The gate must include the user reporting actual visual confirmation, not assent to proceed.