docs(perf #N6.1): apply final-review fixes — spec, baseline doc, issue #55

Final code review of slice 1 flagged one Important issue (the spec's
"zero cost when off" claim for the surface-dump path is technically
violated — _uploadMetadata always writes one dict entry per upload
regardless of env var) plus minor doc/consistency gaps. Applied:

1. Spec §5 "Cost when off": dropped the "Zero" claim; replaced with
   "Negligible — one Dictionary write per upload (~30-50 KB at Holtburg)
   plus a hash-table write per upload. Expensive work (file I/O,
   histogram construction) is still env-gated." This matches reality.

2. Baseline doc §5: rewrote from "Raw logs (scratch, can be deleted)"
   referencing files that were never preserved in this worktree, to
   "Reproducing the measurements" with the actual PowerShell launch
   commands. Honest about the raw logs not being kept; the captured
   medians in section 2 are the canonical record.

3. New issue #55 filed in docs/ISSUES.md — static-entity slow path
   reports ~1.45M meshMissing/5s at r4 standstill, drops to ~0 when
   walking. LOW severity (no visible regression), hypothesis points
   at a "permanently-missing entity gets re-classified every frame"
   pattern that Tier 1 cache doesn't cover.

4. Roadmap shipped table: renamed "N.6.1" row to "N.6 slice 1" to
   match every other artifact's naming. Search-discoverability fix.

None of these change the slice's conclusion or next-phase
recommendation (C.1.5 first, then reduced-scope slice 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erik 2026-05-11 12:51:10 +02:00
parent 76ca3ffca8
commit 41981c4d74
4 changed files with 60 additions and 11 deletions

View file

@ -46,6 +46,40 @@ Copy this block when adding a new issue:
# Active issues
## #55 — Static-entity slow path reports ~1.45M `meshMissing` per 5s at r4 standstill
**Status:** OPEN
**Severity:** LOW (no visible regression — affects a diagnostic counter, not rendered output)
**Filed:** 2026-05-11
**Component:** rendering / `WbDrawDispatcher` static-entity classification path
**Description:** During the Phase N.6 slice 1 baseline measurement (`docs/plans/2026-05-11-phase-n6-perf-baseline.md` §2),
the radius=4 standstill scenario reported `meshMissing ≈ 1,450,000` per 5-second
`[WB-DIAG]` window. The same scenario while walking drops to near-zero (`meshMissing = 0`
in the steady state) as new landblocks stream in and previously-missing meshes resolve.
This suggests the static-entity slow path's mesh-load lifecycle has some delay before
populating for newly-streamed content but eventually catches up; the standstill case
keeps re-counting the same set of entities-with-unresolved-meshes for the duration of
the run. The counter is per-frame so the absolute number scales with FPS — at the
measured ~150 FPS that's ~290K reports/s, or ~1900 entities each reported each frame.
**Root cause / status:** Not investigated. Hypothesis: an entity classification path
counts mesh-missing on every frame for static entities whose `MeshRef` resolution races
the streaming loader. The Tier 1 cache (#53) populates only for entities whose
classification succeeded, so persistently-failing entities run the slow path every frame
forever and bump `meshMissing` every time. If true, the fix is either (a) cache the
"this entity's mesh genuinely doesn't exist" result so we stop re-checking, or (b)
deferred-classify the entity once its `MeshRef` resolves.
**Files:** `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` (the slow path that
increments `_meshesMissing`), `src/AcDream.App/Rendering/Wb/EntityClassificationCache.cs`
(the Tier 1 cache — likely needs to learn about "permanently missing" entries).
**Acceptance:** `meshMissing` should drop to near-zero within ~5 seconds of streaming
settle at any radius/motion combination, not stay at ~1.45M/5s indefinitely at standstill.
---
## #50 — Road-edge tree at 0xA9B1 visible in acdream but not retail
**Status:** OPEN

View file

@ -63,7 +63,7 @@
| N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ |
| N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ |
| N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ |
| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 3050× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ |
| N.6 slice 1 | GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 3050× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ |
Plus polish that doesn't get its own phase number:
- FlyCamera default speed lowered + Shift-to-boost

View file

@ -165,14 +165,29 @@ default N₁=4; worth revisiting if a future quality preset wants N₁=8 as defa
cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question.
Otherwise, hold the C.1.5 → reduced-slice-2 sequence.
## §5. Raw logs
## §5. Reproducing the measurements
Scratch logs from this measurement run (not committed; can be deleted once the doc is
reviewed):
Raw `[WB-DIAG]` output from each run was inspected live during measurement and the
median of the last three steady-state lines from each scenario was transcribed into §2.
The raw launch logs were not preserved — the captured medians in §2 are the canonical
record. To reproduce on the same hardware:
- `baseline-r4-stand.log`, `baseline-r4-walk.log`
- `baseline-r8-stand.log`, `baseline-r8-walk.log`
- `baseline-r12-stand.log`, `baseline-r12-walk.log`
- `baseline-surfaces.log` (launch log for `ACDREAM_DUMP_SURFACES=1` run)
- `baseline-surfaces.txt` (copy of `%LOCALAPPDATA%\acdream\n6-surfaces.txt`)
- `task1-verify.log` (Task 1 manual verification log)
```powershell
$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG = "1"
$env:ACDREAM_STREAM_RADIUS = "4" # or 8, 12
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline.log"
```
Stand still for ~30 s at the target radius (60 s at radius 12 to let streaming settle),
or walk N→E→S→W across one landblock. Then `Select-String -Path baseline.log -Pattern
"\[WB-DIAG\]" | Select-Object -Last 3` captures the steady-state numbers.
For the surface histogram, also set `$env:ACDREAM_DUMP_SURFACES = "1"`, stay in-world
~30 s after streaming has loaded ≥100 textures (the cache-size gate), then read
`$env:LOCALAPPDATA\acdream\n6-surfaces.txt`.

View file

@ -196,7 +196,7 @@ Plus rollups at the end:
### Cost when off
Zero — gated by the env-var check. The dump method is only called from a guarded `if` in `GameWindow.cs`.
Negligible — one `Dictionary<uint, …>` write per `UploadRgba8`/`UploadRgba8AsLayer1Array` call (the `_uploadMetadata` insertion is unconditional so the dump path doesn't have to query GL state when it does fire). At Holtburg with 760 textures that's ~3050 KB of process memory and one hash-table write per upload — invisible at runtime, no GC pressure. The expensive work (file I/O, histogram construction) is gated by the env-var check inside `TickSurfaceHistogramDumpIfEnabled` and only runs when `ACDREAM_DUMP_SURFACES=1`.
---