From 76ca3ffca88bebce365acc9b59e897123f6537d8 Mon Sep 17 00:00:00 2001
From: Erik <erik.nihlen@gmail.com>
Date: Mon, 11 May 2026 12:43:35 +0200
Subject: [PATCH] docs(perf #N6.1): apply code-quality review fixes to baseline
 doc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Code-quality review on commit 13abf96 flagged 3 Important issues in
the baseline document plus 2 minor roadmap consistency gaps. Applied
all of them:

1. The "CPU scales superlinearly with N₁" claim was imprecise because
   CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×).
   Clarified: CPU grows more than linearly with radius N₁ but
   sublinearly with visible-LB count; frustum cull discards most far
   LBs early. The outer per-LB walk still scales with N₁, which is
   what Tier 2's persistent groups address.

2. The "40-50% memory footprint reduction from atlas packing" estimate
   was asserted without derivation and likely too optimistic given all
   surfaces are already power-of-two and same-format (RGBA8). Replaced
   with a more honest bound: "low-MB to ~10 MB absolute saving" with
   explicit per-array metadata overhead reasoning. Conclusion is
   unchanged — atlas adoption still isn't justified given GPU
   under-utilization.

3. The "spec §6 threshold for atlas is >30%" citation pointed at text
   that doesn't exist in the spec. Replaced with "A conventional
   rule-of-thumb" so a future reader doesn't chase a phantom citation.

Plus roadmap consistency:

M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title.
   Shipped YYYY-MM-DD." prefix that every other shipped phase uses.
M2: Added N.6.1 row to the shipped table at the top of the roadmap
   (lines ~55-66) so the at-a-glance shipped list is complete.

None of these change the conclusion or the next-phase recommendation
(C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy
and future-readability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/plans/2026-04-11-roadmap.md              |  3 +-
 .../2026-05-11-phase-n6-perf-baseline.md      | 43 +++++++++++--------
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/docs/plans/2026-04-11-roadmap.md b/docs/plans/2026-04-11-roadmap.md
index 7f535f3..cab23c6 100644
--- a/docs/plans/2026-04-11-roadmap.md
+++ b/docs/plans/2026-04-11-roadmap.md
@@ -63,6 +63,7 @@
 | N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ |
 | N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ |
 | N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ |
+| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ |
 
 Plus polish that doesn't get its own phase number:
 - FlyCamera default speed lowered + Shift-to-boost
@@ -687,7 +688,7 @@ for our deletions/additions; merge upstream `master` periodically.
   manifest at higher radius. Spec acceptance criterion #5 was wrong;
   amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan
   archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`.
-- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.**
+- **✓ SHIPPED — N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** Shipped 2026-05-11.
   Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
   query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
   desktop GL). Added env-gated surface-format histogram dump in `TextureCache`
diff --git a/docs/plans/2026-05-11-phase-n6-perf-baseline.md b/docs/plans/2026-05-11-phase-n6-perf-baseline.md
index 75f6a8e..ba5f0a1 100644
--- a/docs/plans/2026-05-11-phase-n6-perf-baseline.md
+++ b/docs/plans/2026-05-11-phase-n6-perf-baseline.md
@@ -87,9 +87,9 @@ Same as the dimension buckets above since there is only one format. The top-3 tr
 (128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**.
 
 **Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples.
-The spec §6 threshold for "atlas work is justified for memory savings" is >30%; this
-measurement is well above it. However, see §4 for why atlas is not the right next step
-despite the high score.
+A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas
+packing worth the implementation cost for memory savings; this measurement is well above
+that. However, see §4 for why atlas is not the right next step despite the high score.
 
 ## §4. Conclusion + next-phase recommendation
 
@@ -105,13 +105,17 @@ walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at r
 particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU
 comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.
 
-**CPU scales superlinearly with N₁ (near-tier radius).** As N₁ grows from 4 → 8 → 12,
-median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the
-r4 baseline. The Tier 1 entity-classification cache (`EntityClassificationCache`, shipped
-as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the
-outer per-LB walk still scales with N₁. This is exactly what the Tier 2 plan (persistent
-groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating
-the per-frame LB scan entirely.
+**CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with
+visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms →
+6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count
+scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0×
+vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs
+early, but the outer per-LB walk still has to touch each one. The Tier 1 entity-
+classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner
+loop (per-entity classification avoided on cache hits) but the outer walk dominates
+as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at
+`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the
+per-frame LB scan entirely.
 
 **Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12
 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81
@@ -119,13 +123,18 @@ full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes
 comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling
 curve for headroom analysis, not the experience a typical player sees.
 
-**Atlas opportunity is high (59%) but the win is memory-only.** With 96 MB of textures
-and 59% in the top-3 dimension buckets, atlas consolidation would reduce sampler-switch
-count (currently near-zero already, since bindless textures are made resident once) and
-shrink the texture memory footprint by roughly 40–50% through packing. But GPU is not
-bottlenecked on sampler switches or memory bandwidth — the 0.6 ms gpu_us p95 at radius=12
-walking demonstrates this directly. Atlas adoption would cost 1–2 weeks of implementation
-risk for a memory saving the process doesn't currently need at 96 MB.
+**Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB
+of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the
+top buckets share single `Texture2DArray` objects rather than each surface owning its
+own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture
+binds — are already near-zero because bindless textures are made resident once at upload
+and never bound per draw. The remaining win is the per-array metadata overhead × N
+surfaces, which is bounded but not dramatic given all surfaces are already power-of-two
+and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on
+the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler
+switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this
+directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory
+saving the process doesn't currently need at 96 MB.
 
 ### Recommendation