docs(perf #N6.1): apply code-quality review fixes to baseline doc
Code-quality review on commit 13abf96 flagged 3 Important issues in
the baseline document plus 2 minor roadmap consistency gaps. Applied
all of them:
1. The "CPU scales superlinearly with N₁" claim was imprecise because
CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×).
Clarified: CPU grows more than linearly with radius N₁ but
sublinearly with visible-LB count; frustum cull discards most far
LBs early. The outer per-LB walk still scales with N₁, which is
what Tier 2's persistent groups address.
2. The "40-50% memory footprint reduction from atlas packing" estimate
was asserted without derivation and likely too optimistic given all
surfaces are already power-of-two and same-format (RGBA8). Replaced
with a more honest bound: "low-MB to ~10 MB absolute saving" with
explicit per-array metadata overhead reasoning. Conclusion is
unchanged — atlas adoption still isn't justified given GPU
under-utilization.
3. The "spec §6 threshold for atlas is >30%" citation pointed at text
that doesn't exist in the spec. Replaced with "A conventional
rule-of-thumb" so a future reader doesn't chase a phantom citation.
Plus roadmap consistency:
M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title.
Shipped YYYY-MM-DD." prefix that every other shipped phase uses.
M2: Added N.6.1 row to the shipped table at the top of the roadmap
(lines ~55-66) so the at-a-glance shipped list is complete.
None of these change the conclusion or the next-phase recommendation
(C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy
and future-readability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
13abf96a5e
commit
76ca3ffca8
2 changed files with 28 additions and 18 deletions
|
|
@ -63,6 +63,7 @@
|
|||
| N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ |
|
||||
| N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ |
|
||||
| N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ |
|
||||
| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ |
|
||||
|
||||
Plus polish that doesn't get its own phase number:
|
||||
- FlyCamera default speed lowered + Shift-to-boost
|
||||
|
|
@ -687,7 +688,7 @@ for our deletions/additions; merge upstream `master` periodically.
|
|||
manifest at higher radius. Spec acceptance criterion #5 was wrong;
|
||||
amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan
|
||||
archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`.
|
||||
- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.**
|
||||
- **✓ SHIPPED — N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** Shipped 2026-05-11.
|
||||
Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
|
||||
query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
|
||||
desktop GL). Added env-gated surface-format histogram dump in `TextureCache`
|
||||
|
|
|
|||
|
|
@ -87,9 +87,9 @@ Same as the dimension buckets above since there is only one format. The top-3 tr
|
|||
(128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**.
|
||||
|
||||
**Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples.
|
||||
The spec §6 threshold for "atlas work is justified for memory savings" is >30%; this
|
||||
measurement is well above it. However, see §4 for why atlas is not the right next step
|
||||
despite the high score.
|
||||
A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas
|
||||
packing worth the implementation cost for memory savings; this measurement is well above
|
||||
that. However, see §4 for why atlas is not the right next step despite the high score.
|
||||
|
||||
## §4. Conclusion + next-phase recommendation
|
||||
|
||||
|
|
@ -105,13 +105,17 @@ walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at r
|
|||
particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU
|
||||
comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.
|
||||
|
||||
**CPU scales superlinearly with N₁ (near-tier radius).** As N₁ grows from 4 → 8 → 12,
|
||||
median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the
|
||||
r4 baseline. The Tier 1 entity-classification cache (`EntityClassificationCache`, shipped
|
||||
as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the
|
||||
outer per-LB walk still scales with N₁. This is exactly what the Tier 2 plan (persistent
|
||||
groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating
|
||||
the per-frame LB scan entirely.
|
||||
**CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with
|
||||
visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms →
|
||||
6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count
|
||||
scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0×
|
||||
vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs
|
||||
early, but the outer per-LB walk still has to touch each one. The Tier 1 entity-
|
||||
classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner
|
||||
loop (per-entity classification avoided on cache hits) but the outer walk dominates
|
||||
as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at
|
||||
`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the
|
||||
per-frame LB scan entirely.
|
||||
|
||||
**Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12
|
||||
(625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81
|
||||
|
|
@ -119,13 +123,18 @@ full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes
|
|||
comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling
|
||||
curve for headroom analysis, not the experience a typical player sees.
|
||||
|
||||
**Atlas opportunity is high (59%) but the win is memory-only.** With 96 MB of textures
|
||||
and 59% in the top-3 dimension buckets, atlas consolidation would reduce sampler-switch
|
||||
count (currently near-zero already, since bindless textures are made resident once) and
|
||||
shrink the texture memory footprint by roughly 40–50% through packing. But GPU is not
|
||||
bottlenecked on sampler switches or memory bandwidth — the 0.6 ms gpu_us p95 at radius=12
|
||||
walking demonstrates this directly. Atlas adoption would cost 1–2 weeks of implementation
|
||||
risk for a memory saving the process doesn't currently need at 96 MB.
|
||||
**Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB
|
||||
of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the
|
||||
top buckets share single `Texture2DArray` objects rather than each surface owning its
|
||||
own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture
|
||||
binds — are already near-zero because bindless textures are made resident once at upload
|
||||
and never bound per draw. The remaining win is the per-array metadata overhead × N
|
||||
surfaces, which is bounded but not dramatic given all surfaces are already power-of-two
|
||||
and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on
|
||||
the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler
|
||||
switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this
|
||||
directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory
|
||||
saving the process doesn't currently need at 96 MB.
|
||||
|
||||
### Recommendation
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue