From 76ca3ffca88bebce365acc9b59e897123f6537d8 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 12:43:35 +0200 Subject: [PATCH] docs(perf #N6.1): apply code-quality review fixes to baseline doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code-quality review on commit 13abf96 flagged 3 Important issues in the baseline document plus 2 minor roadmap consistency gaps. Applied all of them: 1. The "CPU scales superlinearly with N₁" claim was imprecise because CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×). Clarified: CPU grows more than linearly with radius N₁ but sublinearly with visible-LB count; frustum cull discards most far LBs early. The outer per-LB walk still scales with N₁, which is what Tier 2's persistent groups address. 2. The "40-50% memory footprint reduction from atlas packing" estimate was asserted without derivation and likely too optimistic given all surfaces are already power-of-two and same-format (RGBA8). Replaced with a more honest bound: "low-MB to ~10 MB absolute saving" with explicit per-array metadata overhead reasoning. Conclusion is unchanged — atlas adoption still isn't justified given GPU under-utilization. 3. The "spec §6 threshold for atlas is >30%" citation pointed at text that doesn't exist in the spec. Replaced with "A conventional rule-of-thumb" so a future reader doesn't chase a phantom citation. Plus roadmap consistency: M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title. Shipped YYYY-MM-DD." prefix that every other shipped phase uses. M2: Added N.6.1 row to the shipped table at the top of the roadmap (lines ~55-66) so the at-a-glance shipped list is complete. None of these change the conclusion or the next-phase recommendation (C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy and future-readability. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/plans/2026-04-11-roadmap.md | 3 +- .../2026-05-11-phase-n6-perf-baseline.md | 43 +++++++++++-------- 2 files changed, 28 insertions(+), 18 deletions(-) diff --git a/docs/plans/2026-04-11-roadmap.md b/docs/plans/2026-04-11-roadmap.md index 7f535f3..cab23c6 100644 --- a/docs/plans/2026-04-11-roadmap.md +++ b/docs/plans/2026-04-11-roadmap.md @@ -63,6 +63,7 @@ | N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ | | N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ | | N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ | +| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ | Plus polish that doesn't get its own phase number: - FlyCamera default speed lowered + Shift-to-boost @@ -687,7 +688,7 @@ for our deletions/additions; merge upstream `master` periodically. manifest at higher radius. Spec acceptance criterion #5 was wrong; amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. -- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.** +- **✓ SHIPPED — N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** Shipped 2026-05-11. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated surface-format histogram dump in `TextureCache` diff --git a/docs/plans/2026-05-11-phase-n6-perf-baseline.md b/docs/plans/2026-05-11-phase-n6-perf-baseline.md index 75f6a8e..ba5f0a1 100644 --- a/docs/plans/2026-05-11-phase-n6-perf-baseline.md +++ b/docs/plans/2026-05-11-phase-n6-perf-baseline.md @@ -87,9 +87,9 @@ Same as the dimension buckets above since there is only one format. The top-3 tr (128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**. **Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples. -The spec §6 threshold for "atlas work is justified for memory savings" is >30%; this -measurement is well above it. However, see §4 for why atlas is not the right next step -despite the high score. +A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas +packing worth the implementation cost for memory savings; this measurement is well above +that. However, see §4 for why atlas is not the right next step despite the high score. ## §4. Conclusion + next-phase recommendation @@ -105,13 +105,17 @@ walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at r particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged. -**CPU scales superlinearly with N₁ (near-tier radius).** As N₁ grows from 4 → 8 → 12, -median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the -r4 baseline. The Tier 1 entity-classification cache (`EntityClassificationCache`, shipped -as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the -outer per-LB walk still scales with N₁. This is exactly what the Tier 2 plan (persistent -groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating -the per-frame LB scan entirely. +**CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with +visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms → +6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count +scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0× +vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs +early, but the outer per-LB walk still has to touch each one. The Tier 1 entity- +classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner +loop (per-entity classification avoided on cache hits) but the outer walk dominates +as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at +`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the +per-frame LB scan entirely. **Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 @@ -119,13 +123,18 @@ full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling curve for headroom analysis, not the experience a typical player sees. -**Atlas opportunity is high (59%) but the win is memory-only.** With 96 MB of textures -and 59% in the top-3 dimension buckets, atlas consolidation would reduce sampler-switch -count (currently near-zero already, since bindless textures are made resident once) and -shrink the texture memory footprint by roughly 40–50% through packing. But GPU is not -bottlenecked on sampler switches or memory bandwidth — the 0.6 ms gpu_us p95 at radius=12 -walking demonstrates this directly. Atlas adoption would cost 1–2 weeks of implementation -risk for a memory saving the process doesn't currently need at 96 MB. +**Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB +of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the +top buckets share single `Texture2DArray` objects rather than each surface owning its +own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture +binds — are already near-zero because bindless textures are made resident once at upload +and never bound per draw. The remaining win is the per-array metadata overhead × N +surfaces, which is bounded but not dramatic given all surfaces are already power-of-two +and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on +the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler +switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this +directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory +saving the process doesn't currently need at 96 MB. ### Recommendation