docs(perf #N6.1): apply code-quality review fixes to baseline doc

Code-quality review on commit 13abf96 flagged 3 Important issues in
the baseline document plus 2 minor roadmap consistency gaps. Applied
all of them:

1. The "CPU scales superlinearly with N₁" claim was imprecise because
   CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×).
   Clarified: CPU grows more than linearly with radius N₁ but
   sublinearly with visible-LB count; frustum cull discards most far
   LBs early. The outer per-LB walk still scales with N₁, which is
   what Tier 2's persistent groups address.

2. The "40-50% memory footprint reduction from atlas packing" estimate
   was asserted without derivation and likely too optimistic given all
   surfaces are already power-of-two and same-format (RGBA8). Replaced
   with a more honest bound: "low-MB to ~10 MB absolute saving" with
   explicit per-array metadata overhead reasoning. Conclusion is
   unchanged — atlas adoption still isn't justified given GPU
   under-utilization.

3. The "spec §6 threshold for atlas is >30%" citation pointed at text
   that doesn't exist in the spec. Replaced with "A conventional
   rule-of-thumb" so a future reader doesn't chase a phantom citation.

Plus roadmap consistency:

M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title.
   Shipped YYYY-MM-DD." prefix that every other shipped phase uses.
M2: Added N.6.1 row to the shipped table at the top of the roadmap
   (lines ~55-66) so the at-a-glance shipped list is complete.

None of these change the conclusion or the next-phase recommendation
(C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy
and future-readability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erik 2026-05-11 12:43:35 +02:00
parent 13abf96a5e
commit 76ca3ffca8
2 changed files with 28 additions and 18 deletions

View file

@ -63,6 +63,7 @@
| N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ | | N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ |
| N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ | | N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ |
| N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ | | N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ |
| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 3050× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ |
Plus polish that doesn't get its own phase number: Plus polish that doesn't get its own phase number:
- FlyCamera default speed lowered + Shift-to-boost - FlyCamera default speed lowered + Shift-to-boost
@ -687,7 +688,7 @@ for our deletions/additions; merge upstream `master` periodically.
manifest at higher radius. Spec acceptance criterion #5 was wrong; manifest at higher radius. Spec acceptance criterion #5 was wrong;
amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan
archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`.
- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.** - **✓ SHIPPED — N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** Shipped 2026-05-11.
Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
desktop GL). Added env-gated surface-format histogram dump in `TextureCache` desktop GL). Added env-gated surface-format histogram dump in `TextureCache`

View file

@ -87,9 +87,9 @@ Same as the dimension buckets above since there is only one format. The top-3 tr
(128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**. (128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**.
**Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples. **Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples.
The spec §6 threshold for "atlas work is justified for memory savings" is >30%; this A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas
measurement is well above it. However, see §4 for why atlas is not the right next step packing worth the implementation cost for memory savings; this measurement is well above
despite the high score. that. However, see §4 for why atlas is not the right next step despite the high score.
## §4. Conclusion + next-phase recommendation ## §4. Conclusion + next-phase recommendation
@ -105,13 +105,17 @@ walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at r
particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU
comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged. comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.
**CPU scales superlinearly with N₁ (near-tier radius).** As N₁ grows from 4 → 8 → 12, **CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with
median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms →
r4 baseline. The Tier 1 entity-classification cache (`EntityClassificationCache`, shipped 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count
as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0×
outer per-LB walk still scales with N₁. This is exactly what the Tier 2 plan (persistent vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs
groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating early, but the outer per-LB walk still has to touch each one. The Tier 1 entity-
the per-frame LB scan entirely. classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner
loop (per-entity classification avoided on cache hits) but the outer walk dominates
as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at
`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the
per-frame LB scan entirely.
**Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12 **Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12
(625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81
@ -119,13 +123,18 @@ full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes
comfortable 200400 FPS at the default preset. The numbers above characterize the scaling comfortable 200400 FPS at the default preset. The numbers above characterize the scaling
curve for headroom analysis, not the experience a typical player sees. curve for headroom analysis, not the experience a typical player sees.
**Atlas opportunity is high (59%) but the win is memory-only.** With 96 MB of textures **Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB
and 59% in the top-3 dimension buckets, atlas consolidation would reduce sampler-switch of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the
count (currently near-zero already, since bindless textures are made resident once) and top buckets share single `Texture2DArray` objects rather than each surface owning its
shrink the texture memory footprint by roughly 4050% through packing. But GPU is not own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture
bottlenecked on sampler switches or memory bandwidth — the 0.6 ms gpu_us p95 at radius=12 binds — are already near-zero because bindless textures are made resident once at upload
walking demonstrates this directly. Atlas adoption would cost 12 weeks of implementation and never bound per draw. The remaining win is the per-array metadata overhead × N
risk for a memory saving the process doesn't currently need at 96 MB. surfaces, which is bounded but not dramatic given all surfaces are already power-of-two
and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on
the order of low-MB to ~10 MB, not a 4050% halving. GPU is not bottlenecked on sampler
switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this
directly), so atlas adoption would cost 12 weeks of implementation risk for a memory
saving the process doesn't currently need at 96 MB.
### Recommendation ### Recommendation