From 05d590cd5486bb92fb31e67e832ef0061de06bdc Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 11:03:44 +0200 Subject: [PATCH 1/7] =?UTF-8?q?docs(perf):=20Phase=20N.6=20slice=201=20?= =?UTF-8?q?=E2=80=94=20spec=20for=20gpu=5Fus=20fix=20+=20radius=3D12=20bas?= =?UTF-8?q?eline?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brainstormed design for the first slice of Phase N.6 (perf polish). Slice 1 ships two commits: (1) fix the GPU timing query double-buffering in WbDrawDispatcher (cross-vendor ring of 3, read-before-overwrite), (2) add an env-gated surface-format histogram dump + capture the radius=12 perf baseline at Holtburg. Slice 2 (TextureCache cleanup + shader migration + optional persistent-mapped buffers) is deferred until after C.1.5 (PES emitter wiring), with the next-phase decision to be made on the baseline numbers slice 1 produces. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-11-phase-n6-slice1-design.md | 335 ++++++++++++++++++ 1 file changed, 335 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md diff --git a/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md b/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md new file mode 100644 index 0000000..3c35307 --- /dev/null +++ b/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md @@ -0,0 +1,335 @@ +# Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline (design) + +**Created:** 2026-05-11. +**Status:** approved design, ready for implementation plan. +**Phase context:** Phase N.6 (perf polish) split into two slices on 2026-05-11 — this is slice 1. Slice 2 (legacy `TextureCache` cleanup + shader migration + optional persistent-mapped buffers) is deferred until after C.1.5 (PES emitter wiring), and gets its own spec then. +**Roadmap entry:** [docs/plans/2026-04-11-roadmap.md](../../plans/2026-04-11-roadmap.md) lines 690-705 (to be amended in commit 2 to reflect the slice split). + +--- + +## §1. Problem + +`WbDrawDispatcher` runs `glBeginQuery(GL_TIME_ELAPSED, …) … glEndQuery` around the opaque and transparent indirect draws, then immediately polls `glGetQueryObject(…, ResultAvailable, …)` **on the same frame** to read the result. The GPU has not finished executing the draw by the time the polling call runs, so `avail` is always 0, the sample is dropped, and the `_gpuSamples` ring stays all-zero forever. The user sees `gpu_us=0m/0p95` in every `[WB-DIAG]` line under `ACDREAM_WB_DIAG=1`. + +Verified at [src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs:849-859](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs#L849). + +Without this fix: +- Every future perf decision (Tier 2 vs Tier 3 vs slice 2 vs do-nothing) is made on CPU-only data. +- We cannot tell whether the dispatcher is CPU-bound or GPU-bound at radius=12. +- We cannot validate that N.5/N.5b/Tier 1 changes actually moved GPU time. + +This slice ships the GPU-timing fix and uses the now-working diagnostic to produce one authoritative perf baseline document so the next phase decision (slice 2 vs C.1.5 vs Tier 2/3) is data-driven. + +--- + +## §2. Goals and non-goals + +### Goals + +1. `[WB-DIAG]` reports non-zero `gpu_us` for the entity dispatcher's opaque+transparent passes at Holtburg radius=12 with `ACDREAM_WB_DIAG=1`. +2. The fix works on AMD, NVIDIA, and Intel desktop OpenGL drivers without vendor-specific code paths. +3. Produce a baseline document at `docs/plans/2026-05-11-phase-n6-perf-baseline.md` with CPU and GPU numbers across radii 4 / 8 / 12 (standstill + walking), a surface-format histogram, and a memory snapshot. +4. The baseline document closes with a recommendation paragraph: should the next phase be N.6 slice 2 (perf cleanup), C.1.5 (PES wiring), or escalation to Tier 2 (static/dynamic split). Rationale grounded in the captured numbers. +5. `dotnet build` and `dotnet test` green; no functional regression in the rendering path. + +### Non-goals + +- Persistent-mapped buffers (`BufferSubData` → `GL_MAP_PERSISTENT_BIT`). Deferred to slice 2 unless the baseline shows it's a hot spot. +- Legacy `TextureCache` cleanup, `mesh.frag` orphan deletion, sky/UI text shader migration to bindless. All deferred to slice 2. +- WB atlas adoption / texture-array consolidation. Deferred to slice 2 pending the surface histogram from goal 3. +- Adding GPU queries to terrain / sky / particle / debug-line passes. Slice 1 keeps query scope to the existing two queries inside `WbDrawDispatcher` (opaque-pass + transparent-pass). +- GPU compute culling. That's Tier 3 of [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md), separate roadmap. + +--- + +## §3. Design decisions (from brainstorming, 2026-05-11) + +| # | Decision | Rationale | +|---|---|---| +| Q1 | **Ring of 3 query-pair slots** (not ring of 2) | Vendor-neutral. NVIDIA drivers with triple-buffering + vsync can queue ~3 frames ahead; AMD typically 1–2; Intel iGPUs vary. Ring of 2 plus `ResultAvailable` guard works everywhere but drops more samples on deeper queues. Ring of 3 collects samples reliably across all desktop drivers. Cost: one extra `GLuint` query pair (~12 bytes of GPU state) plus one frame of latency on the printed value, which is invisible because the diagnostic is a 256-frame moving-window median. | +| Q2 | **Read-before-issue, same-slot pattern** | On frame N, attempt to read slot `N%3` (which contains frame N-3's result — the *oldest* unread data, ~50 ms ago at 60 fps) *before* overwriting it with frame N's queries. Reading the oldest data maximizes the chance that `ResultAvailable=1` across all desktop drivers. Use `ResultAvailable` as a guard — if not ready, skip the sample. `MedianMicros` already computes over the non-zero subset, so dropped samples don't poison the result. | +| Q3 | **Keep query scope unchanged** — just the two existing queries (opaque-pass + transparent-pass for the WB dispatcher) | Slice 1 is "fix what's broken," not "expand instrumentation." Adding terrain / sky / particle queries is slice-2-or-later work and would inflate this slice past the half-day budget. | +| Q4 | **Surface-format histogram via env-gated one-shot dump** (`ACDREAM_DUMP_SURFACES=1`) | The atlas-adoption decision in slice 2 needs to know whether enough surfaces share dimensions/format to make consolidation worthwhile. A one-time dump on first frame to a fixed file path is cheap to implement, zero cost when off, and lets the user re-run cheaply when needed. Output goes to `%LOCALAPPDATA%\acdream\n6-surfaces.txt` (not stdout) to avoid spamming the launch log. | +| Q5 | **Two commits, not one** | Commit 1 is the GPU-timing fix (code change, regression-bisectable). Commit 2 is the surface-dump path + baseline document (docs + env-gated diag). Keeping them separate means a future bisect for a GPU-timing regression doesn't land on a doc commit. | +| Q6 | **Baseline measurement is Holtburg + High preset only** (per the user's hardware) | Slice 1 doesn't pretend to be a cross-hardware perf survey. It's one canonical measurement on the dev machine. The document template captures setup explicitly so a NVIDIA / lower-end run can be added later without re-architecting the doc. | + +--- + +## §4. Change 1 — GPU query double-buffering + +### Files touched + +- `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` — single-file change, ~30 LOC delta. + +### Current state (verified) + +```csharp +// Field declarations near line 155: +private uint _gpuQueryOpaque; +private uint _gpuQueryTransparent; +private readonly long[] _gpuSamples = new long[256]; +private bool _gpuQueriesInitialized; + +// Init at line ~347: +if (diag && !_gpuQueriesInitialized) { + _gpuQueryOpaque = _gl.GenQuery(); + _gpuQueryTransparent = _gl.GenQuery(); + _gpuQueriesInitialized = true; +} + +// Around the opaque draw at line ~774: +if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque); +… opaque indirect draw … +if (diag && _gpuQueriesInitialized) _gl.EndQuery(QueryTarget.TimeElapsed); + +// Same pattern around transparent draw at line ~823. + +// Read at line ~849 — BUG: same frame, never ready: +if (_gpuQueriesInitialized) { + _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail); + if (avail != 0) { + _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs); + _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs); + long gpuUs = (long)((opaqueNs + transNs) / 1000UL); + _gpuSamples[_gpuSampleCursor] = gpuUs; + _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length; + } +} + +// Dispose at line ~1140: +if (_gpuQueriesInitialized) { + _gl.DeleteQuery(_gpuQueryOpaque); + _gl.DeleteQuery(_gpuQueryTransparent); +} +``` + +### Target state + +```csharp +private const int GpuQueryRingDepth = 3; +private readonly uint[] _gpuQueryOpaque = new uint[GpuQueryRingDepth]; +private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth]; +private int _gpuQueryFrameIndex; // increments every frame we issue queries +private bool _gpuQueriesInitialized; + +// Init: +if (diag && !_gpuQueriesInitialized) { + for (int i = 0; i < GpuQueryRingDepth; i++) { + _gpuQueryOpaque[i] = _gl.GenQuery(); + _gpuQueryTransparent[i] = _gl.GenQuery(); + } + _gpuQueriesInitialized = true; +} + +// Compute the slot index for this frame. We read this slot's previous +// contents (frame N-3's queries — the oldest data in the ring) and then +// overwrite it with this frame's queries. +int slot = _gpuQueryFrameIndex % GpuQueryRingDepth; + +// Read frame N-3's result BEFORE overwriting. Gated on "we've completed +// at least one full ring of writes" so we don't read uninitialized slots +// during warm-up. +if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) { + _gl.GetQueryObject(_gpuQueryOpaque[slot], QueryObjectParameterName.ResultAvailable, out int avail); + if (avail != 0) { + _gl.GetQueryObject(_gpuQueryOpaque[slot], QueryObjectParameterName.Result, out ulong opaqueNs); + _gl.GetQueryObject(_gpuQueryTransparent[slot], QueryObjectParameterName.Result, out ulong transNs); + long gpuUs = (long)((opaqueNs + transNs) / 1000UL); + _gpuSamples[_gpuSampleCursor] = gpuUs; + _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length; + } + // If avail==0 the sample is dropped silently. MedianMicros already + // computes over the non-zero subset, so dropped samples don't poison + // the median. +} + +// Issue this frame's queries into the same slot — overwriting the data +// we just (attempted to) read. +if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[slot]); +… opaque indirect draw … +if (diag && _gpuQueriesInitialized) _gl.EndQuery(QueryTarget.TimeElapsed); + +… same for transparent with _gpuQueryTransparent[slot] … + +_gpuQueryFrameIndex++; + +// Dispose: loop over the ring. +``` + +### Behavior + +- Frames 0, 1, 2 issue queries but no reads happen (the `>= RingDepth` gate skips them). +- Frame 3 reads frame 0's queries (oldest in ring) and writes new queries into slot 0. Frame 4 reads frame 1's, etc. +- Steady-state: each frame's queries are read exactly once, three frames after they were issued. Frames 0/1/2's queries are intentionally lost (startup artifact, ~50 ms of measurement). +- The diagnostic prints over a 256-frame moving window — at 200 fps that's ~1.3 s of history, so the first valid `gpu_us` median appears within ~2 s of moving. + +### Diag interaction + +`MaybeFlushDiag` already prints every 5 s; no change there. + +`MedianMicros` already filters non-zero samples; no change there. + +The user-visible behavior change: `gpu_us=Xm/Yp95` numbers in `[WB-DIAG]` reflect real GPU draw time for the entity dispatcher's two indirect calls. + +--- + +## §5. Change 2 — Surface-format histogram one-shot dump + +### Files touched + +- `src/AcDream.App/Rendering/TextureCache.cs` — add an env-gated dump method, ~40 LOC. +- One caller in `GameWindow.cs` (first-frame hook) — ~5 LOC. + +### Trigger + +Env var `ACDREAM_DUMP_SURFACES=1`. When set, on **frame index 600** of the session (~10 s at 60 fps, ~3 s at 200 fps — both well past streaming settle at radius≤12), iterate all entries in the bindless caches (`_bindlessBySurfaceId`, `_bindlessByOverridden`, `_bindlessByPalette`) and emit a histogram to `%LOCALAPPDATA%\acdream\n6-surfaces.txt`. One-shot — fires once per session at the exact frame, no repeats. The user can re-launch to capture a fresh snapshot. + +### Output schema + +Per entry, one line: `surfaceId(uint32 hex), width(uint16), height(uint16), format(string), byteCount(uint32)`. + +Plus rollups at the end: +- Count by `(width × height)` bucket — answers "how many distinct dimension pairs?". +- Count by source `SurfaceFormat` (INDEX16, BGRA, DXT1, etc.). +- Total bytes (sum of `width × height × 4` for RGBA8 uploads). +- Top 10 most-shared `(width, height, format)` triples by count — this is the atlas-opportunity input. + +### Cost when off + +Zero — gated by the env-var check. The dump method is only called from a guarded `if` in `GameWindow.cs`. + +--- + +## §6. Change 3 — Baseline document + +### File + +`docs/plans/2026-05-11-phase-n6-perf-baseline.md`. + +### Setup section + +- Hardware: Radeon RX 9070 XT (the user's machine). +- Resolution: 1440p. +- Quality preset: High (default). +- Connection: live ACE at `127.0.0.1:9000`, character `+Acdream` at Holtburg. +- Sky: clear midday, controlled via `F7` to remove weather noise. +- Build: Debug (matches the user's normal launch). +- Date measured: 2026-05-11. + +### Measurements + +Three radii: 4, 8, 12. Two motion modes per radius: standstill (camera anchored 30 s) and walking (`+Acdream` walks N→E→S→W across one landblock, 30 s). + +Per radius/mode, capture from `[WB-DIAG]` and the window title: +- CPU dispatcher: `cpu_us` median, p95. +- GPU dispatcher: `gpu_us` median, p95 (now real). +- FPS. +- Entities seen / drawn. +- Groups. +- Frame time (window title). + +### Memory snapshot + +One-time output from the `ACDREAM_DUMP_SURFACES=1` run, summarized: +- Total surfaces in cache. +- Total GPU texture bytes. +- Dimension distribution (top 10 by count). +- Format distribution. +- Atlas-opportunity score: percentage of surfaces in the top-3 dimension buckets. + +### Conclusion section + +A recommendation paragraph addressing: +1. Is the entity dispatcher CPU-bound or GPU-bound at radius=12? +2. Does `gpu_us` p95 leave headroom or is the GPU saturated? +3. Does the atlas-opportunity score justify slice-2 atlas work? +4. Given (1)–(3), what should the next phase be? Slice 2 (perf cleanup), C.1.5 (PES emitter wiring), or escalation to Tier 2 (static/dynamic split)? + +The paragraph is opinionated — the next phase decision should be obvious from the numbers, not require a separate debate. + +--- + +## §7. Test plan + +### Automated tests (none new) + +This slice is intentionally test-light: +- The GPU-timing fix has no observable behavior in tests — it only changes a diagnostic readout. No new unit tests. +- The surface-dump path is env-gated diag; no need to lock its output format in tests. +- Existing 1688 tests must remain green. `WbDrawDispatcher` tests (bucketing, indirect-command construction, classification cache) must not be perturbed. + +### Manual verification + +1. Launch live with `ACDREAM_WB_DIAG=1`. Walk Holtburg for ~30 s. Confirm `[WB-DIAG]` prints `gpu_us=Xm/Yp95` with X > 0 within ~5 s. +2. Launch live with `ACDREAM_DUMP_SURFACES=1 ACDREAM_WB_DIAG=1`. Wait ~10 s for streaming to settle. Open `%LOCALAPPDATA%\acdream\n6-surfaces.txt`. Confirm it contains a non-empty histogram. +3. Run the baseline measurement procedure end-to-end. Confirm the document populates with real numbers, not placeholders. + +--- + +## §8. Sequencing / ship gates + +### Commit 1 — GPU query fix + +**Message:** `feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher` + +**Scope:** `WbDrawDispatcher.cs` changes only. Build green, tests green, manual verification step 1 from §7 passes. + +**Gate:** if `gpu_us` still reports 0 after ~10 s of movement, do NOT proceed to commit 2. Bump ring depth to 4 or investigate driver behavior before continuing. + +### Commit 2 — Baseline doc + surface dump + +**Message:** `docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path` + +**Scope:** `TextureCache.cs` dump method, `GameWindow.cs` hook, `docs/plans/2026-05-11-phase-n6-perf-baseline.md`, and the roadmap amendment at `docs/plans/2026-04-11-roadmap.md` lines 690-705 (split N.6 into slice 1 / slice 2 in the bullet list). + +**Gate:** manual verification steps 2 and 3 from §7 pass; baseline document's conclusion paragraph is filled in (not "TBD"); roadmap update lands in the same commit. + +--- + +## §9. Acceptance criteria + +1. `[WB-DIAG]` reports non-zero `gpu_us` for the entity dispatcher's opaque+transparent passes at Holtburg radius=12 with `ACDREAM_WB_DIAG=1`. +2. The fix uses only core OpenGL 3.3+ features (`GL_TIME_ELAPSED`, `glGetQueryObject`, `GL_QUERY_RESULT_AVAILABLE`). No vendor-specific extensions. +3. `docs/plans/2026-05-11-phase-n6-perf-baseline.md` exists, contains numbers (not placeholders) for the 3 radii × 2 motion modes, contains the surface histogram summary, and closes with a recommendation paragraph. +4. The roadmap entry at `docs/plans/2026-04-11-roadmap.md:690-705` is amended to reflect the slice split. +5. `dotnet build` succeeds with no new warnings. +6. `dotnet test` succeeds with the existing pass/fail baseline (1688 passing, ~8 pre-existing physics/input failures unchanged). +7. No visible regression in the rendering path — Holtburg outdoor, day/night cycle, entity rendering, transparent surfaces all look the same as before the change. + +--- + +## §10. Risks + +| Risk | Likelihood | Mitigation | +|---|---|---| +| `ResultAvailable` is 0 even for frame N-3 (driver queues 4+ frames ahead) | Low — would be unusual on desktop GL | Sample is dropped silently; diagnostic prints zeros; user reports it. Fix: bump `GpuQueryRingDepth` to 4. No regression in the render path itself. | +| Query-pair allocation leaks across init/Dispose cycles | Low | Dispose loop deletes the full ring; existing pattern just gains an array index. | +| Surface-dump path fires before streaming settles, gets a sparse picture | Medium | Document the procedure as "wait ~10 s after entering world before reading the file." The dump path itself can also be re-runnable if needed (deferred unless slice 1 hits this in practice). | +| Conclusion paragraph in the baseline document is hard to write because the numbers don't clearly favor one direction | Medium — this is the slice's whole purpose | Acknowledge the ambiguity in the document and propose a "slice 1 conclusion plus a short re-brainstorm with the user" flow. The slice still ships if the numbers force a re-brainstorm; the value is in having the numbers, not in pre-deciding the answer. | +| Hidden vendor-specific behavior in `GL_TIME_ELAPSED` produces non-comparable numbers across hardware | Low — `GL_TIME_ELAPSED` is nanosecond-accurate per spec | Document the measurement hardware explicitly in the baseline doc setup section so future runs on different GPUs can be tagged appropriately. | + +--- + +## §11. Out of scope / future work + +These are explicitly NOT in slice 1, listed here so the next phase has a clean shopping list: + +- **Slice 2 — `TextureCache` cleanup.** Delete orphan `mesh.frag` (verify zero callers post-N.5 amendment). Delete dead entity-style legacy caches (`_handlesByOverridden`, `_handlesByPalette`) that no live renderer reads. Decide on bindless-everywhere vs legacy-island for the remaining `sampler2D` consumers (sky, UI text, particles). +- **Slice 2 — Particle shader migration.** Tied to C.1.5 outcome; particles migrate after C.1.5 lands more visible content to regression-test against. +- **Slice 2 — Persistent-mapped buffers.** Conditional on slice 1's baseline showing `BufferSubData` as a hot spot. +- **Slice 2 — WB atlas adoption.** Conditional on slice 1's surface histogram showing a real opportunity. +- **C.1.5 — PES emitter wiring.** Portals, chimneys, fireplaces. Separate phase; gets its own brainstorm/spec. +- **Tier 2 — static/dynamic split with persistent groups.** Separate roadmap at [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md). +- **Tier 3 — GPU compute culling.** Depends on Tier 2 first. Same roadmap. +- **Cross-vendor perf comparison.** Slice 1 is one machine. A NVIDIA companion run is a backlog item, not in scope. + +--- + +## §12. References + +- Existing dispatcher code: [src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs). +- Existing texture cache: [src/AcDream.App/Rendering/TextureCache.cs](../../../src/AcDream.App/Rendering/TextureCache.cs). +- Prior perf baseline (style template): [docs/plans/2026-05-09-phase-n5b-perf-baseline.md](../../plans/2026-05-09-phase-n5b-perf-baseline.md). +- Roadmap N.6 entry: [docs/plans/2026-04-11-roadmap.md:690-705](../../plans/2026-04-11-roadmap.md). +- Perf tiers 2/3 alternative path: [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md). +- Phase C.1 plan with C.1.5 scope: [docs/plans/2026-04-27-phase-c1-pes-particles.md:285-295](../../plans/2026-04-27-phase-c1-pes-particles.md). From a4931eeaa2a2a3ea65b0a340883c30d6a83c6024 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 11:12:26 +0200 Subject: [PATCH 2/7] =?UTF-8?q?docs(perf):=20Phase=20N.6=20slice=201=20?= =?UTF-8?q?=E2=80=94=20implementation=20plan?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step-by-step plan for the two-commit slice: fix WbDrawDispatcher's gpu_us double-buffering bug (ring-of-3 query slots, read-before-overwrite, vendor-neutral) then capture the radius=12 baseline at Holtburg with the now-working diagnostic. Includes exact old_string/new_string Edit patterns for every code change, PowerShell launch + measurement procedure for the manual baseline, baseline doc template with explicit fill-in slots, and a per-criterion acceptance checklist. Output companion to docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (commit 05d590c). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../plans/2026-05-11-phase-n6-slice1.md | 912 ++++++++++++++++++ 1 file changed, 912 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-11-phase-n6-slice1.md diff --git a/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md b/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md new file mode 100644 index 0000000..0270b2f --- /dev/null +++ b/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md @@ -0,0 +1,912 @@ +# Phase N.6 slice 1 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Fix the broken `gpu_us` diagnostic in `WbDrawDispatcher` (vendor-neutral OpenGL query ring) and produce one authoritative perf baseline document at Holtburg radius=12 so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) is grounded in real numbers. + +**Architecture:** Two commits. Commit 1 changes only `WbDrawDispatcher.cs` — replaces the two `uint` GL query handles with ring-of-3 arrays and moves the result read to *before* the next frame overwrites the slot (read frame N-3's queries, then overwrite). Commit 2 adds an env-gated surface-format histogram dump in `TextureCache.cs`, captures the actual measurement, writes the baseline doc, and amends the roadmap entry. No new automated tests — the GPU-timing fix has no observable behavior in tests, and the dump path is env-gated diagnostic only; verification is manual launch-and-look. + +**Tech Stack:** C# / .NET 10, Silk.NET (OpenGL 4.3+), `dotnet build` / `dotnet test` from PowerShell, live ACE on `127.0.0.1:9000` for in-world verification. + +**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../specs/2026-05-11-phase-n6-slice1-design.md) (committed at `05d590c`). + +--- + +## File Structure + +| File | Action | Responsibility | +|---|---|---| +| [`src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs) | Modify | Replace 2 `uint` query handles with ring-of-3 arrays; move query result read to before next-frame overwrite. | +| [`src/AcDream.App/Rendering/TextureCache.cs`](../../../src/AcDream.App/Rendering/TextureCache.cs) | Modify | Add upload-time dimension/format tracking + env-gated `TickSurfaceHistogramDumpIfEnabled()` method that fires once at frame 600. | +| [`src/AcDream.App/Rendering/GameWindow.cs`](../../../src/AcDream.App/Rendering/GameWindow.cs) | Modify | Call `_textureCache.TickSurfaceHistogramDumpIfEnabled()` once per frame in `OnRender`. | +| `docs/plans/2026-05-11-phase-n6-perf-baseline.md` | Create | Baseline measurement doc: setup, numbers at radii 4/8/12 (standstill + walking), surface histogram summary, conclusion paragraph recommending next phase. | +| [`docs/plans/2026-04-11-roadmap.md`](../../plans/2026-04-11-roadmap.md) lines 690-705 | Modify | Amend N.6 entry to reflect the slice 1 / slice 2 split. | + +--- + +## Task 1: GPU query ring buffering (commit 1) + +**Files:** +- Modify: `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` + +The five edit zones are well-isolated by exact strings. Apply them in order — do NOT reorder; the build won't fail mid-way but the resulting code is easier to review if applied as documented. + +- [ ] **Step 1.1: Replace the field declarations (~line 155)** + +Use Edit to replace the existing field block: + +**old_string:** +```csharp + private uint _gpuQueryOpaque; + private uint _gpuQueryTransparent; + private readonly long[] _gpuSamples = new long[256]; // microseconds + private int _gpuSampleCursor; + private bool _gpuQueriesInitialized; +``` + +**new_string:** +```csharp + // GPU timing uses a ring of 3 query-pair slots so the read of frame N-3's + // result lands when the GPU has finished (~50ms after issue on a typical + // 60fps frame). Ring of 3 is the vendor-neutral choice: NVIDIA drivers with + // triple-buffering+vsync can queue ~3 frames ahead, AMD typically 1-2, + // Intel iGPUs vary. ResultAvailable is the safety guard if the GPU is + // still working when we try to read. + private const int GpuQueryRingDepth = 3; + private readonly uint[] _gpuQueryOpaque = new uint[GpuQueryRingDepth]; + private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth]; + private int _gpuQueryFrameIndex; + private readonly long[] _gpuSamples = new long[256]; // microseconds + private int _gpuSampleCursor; + private bool _gpuQueriesInitialized; +``` + +- [ ] **Step 1.2: Replace the init block (~line 347)** + +**old_string:** +```csharp + if (diag && !_gpuQueriesInitialized) + { + _gpuQueryOpaque = _gl.GenQuery(); + _gpuQueryTransparent = _gl.GenQuery(); + _gpuQueriesInitialized = true; + } +``` + +**new_string:** +```csharp + if (diag && !_gpuQueriesInitialized) + { + for (int i = 0; i < GpuQueryRingDepth; i++) + { + _gpuQueryOpaque[i] = _gl.GenQuery(); + _gpuQueryTransparent[i] = _gl.GenQuery(); + } + _gpuQueriesInitialized = true; + } +``` + +- [ ] **Step 1.3: Insert the read-before-overwrite block + compute slot just before the opaque query begin (~line 774)** + +This step replaces the existing single-line `BeginQuery` for opaque with a block that first computes the slot, reads the slot's frame N-3 result (gated on having completed one ring), then issues the new query into the same slot. + +**old_string:** +```csharp + _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer); + if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque); +``` + +**new_string:** +```csharp + _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer); + + // GPU timing: compute this frame's ring slot. We read frame N-3's + // result (the oldest data in the ring) before overwriting it with + // frame N's queries. See spec §3 Q1/Q2 + §4 in + // docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md. + int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth; + if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) + { + _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail); + if (avail != 0) + { + _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.Result, out ulong opaqueNs); + _gl.GetQueryObject(_gpuQueryTransparent[gpuQuerySlot], QueryObjectParameterName.Result, out ulong transNs); + long gpuUs = (long)((opaqueNs + transNs) / 1000UL); + _gpuSamples[_gpuSampleCursor] = gpuUs; + _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length; + } + // If avail==0 the sample is dropped silently. MedianMicros + // computes over the non-zero subset, so dropped samples don't + // poison the median. + } + + if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[gpuQuerySlot]); +``` + +- [ ] **Step 1.4: Update the transparent query begin to use the same slot (~line 823)** + +**old_string:** +```csharp + if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent); +``` + +**new_string:** +```csharp + if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent[gpuQuerySlot]); +``` + +- [ ] **Step 1.5: Replace the buggy in-frame read block + increment frame counter (~line 849)** + +**old_string:** +```csharp + // Read GPU samples non-blocking; the result for the previous frame's + // queries should be ready by now. If not, drop the sample (don't stall + // the CPU waiting for the GPU). + if (_gpuQueriesInitialized) + { + _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail); + if (avail != 0) + { + _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs); + _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs); + long gpuUs = (long)((opaqueNs + transNs) / 1000UL); + _gpuSamples[_gpuSampleCursor] = gpuUs; + _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length; + } + } + + _drawsIssued += _opaqueDrawCount + _transparentDrawCount; +``` + +**new_string:** +```csharp + // GPU sample read happens BEFORE issuing the next frame's queries + // (see step 1.3 above). Increment the frame counter here so the + // next call computes a fresh slot. + if (_gpuQueriesInitialized) _gpuQueryFrameIndex++; + + _drawsIssued += _opaqueDrawCount + _transparentDrawCount; +``` + +- [ ] **Step 1.6: Update Dispose to delete the full ring (~line 1140)** + +**old_string:** +```csharp + if (_gpuQueriesInitialized) + { + _gl.DeleteQuery(_gpuQueryOpaque); + _gl.DeleteQuery(_gpuQueryTransparent); + } +``` + +**new_string:** +```csharp + if (_gpuQueriesInitialized) + { + for (int i = 0; i < GpuQueryRingDepth; i++) + { + _gl.DeleteQuery(_gpuQueryOpaque[i]); + _gl.DeleteQuery(_gpuQueryTransparent[i]); + } + } +``` + +- [ ] **Step 1.7: Build** + +Run from the worktree root: + +```powershell +dotnet build +``` + +Expected: build succeeds with no new warnings or errors. If the build fails, the most likely cause is a missed string in one of the steps above — re-grep `_gpuQueryOpaque` and `_gpuQueryTransparent` in `WbDrawDispatcher.cs` and confirm every reference uses the array-indexed form `[gpuQuerySlot]` or `[i]`. + +- [ ] **Step 1.8: Run the test suite** + +```powershell +dotnet test --no-build +``` + +Expected: same pass/fail baseline as before the change (~1688 passing, ~8 pre-existing physics/input failures unchanged). No new failures. + +- [ ] **Step 1.9: Manual verification — launch live and confirm `gpu_us` reports non-zero** + +```powershell +$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call" +$env:ACDREAM_LIVE = "1" +$env:ACDREAM_TEST_HOST = "127.0.0.1" +$env:ACDREAM_TEST_PORT = "9000" +$env:ACDREAM_TEST_USER = "testaccount" +$env:ACDREAM_TEST_PASS = "testpassword" +$env:ACDREAM_WB_DIAG = "1" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task1-verify.log" +``` + +In-world: walk Holtburg for ~30 seconds. Close the window when done. + +Verification check on `task1-verify.log`: + +```powershell +Select-String -Path task1-verify.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 5 +``` + +Expected output: at least one `[WB-DIAG]` line where `gpu_us=Xm/Yp95` has X > 0 (typically tens to low-hundreds of microseconds at radius=4-12 on a modern GPU). If `gpu_us=0m/0p95` persists for the entire run, the fix didn't take — check whether the build actually rebuilt (try `dotnet build -c Debug` then re-launch). + +Also confirm: no visible regression in the client. Entities render, animations play, sky cycles. Close the client cleanly. + +- [ ] **Step 1.10: Commit** + +```powershell +git add src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs +git commit -m @' +feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher + +The dispatcher's GPU TimeElapsed queries were polled in the same frame +as the indirect draw, so glGetQueryObject(ResultAvailable) always +returned 0 and gpu_us in [WB-DIAG] was stuck at 0m/0p95. + +Replace the 2 single-handle queries with ring-of-3 arrays and move the +result read to BEFORE issuing the next frame's queries into the same +slot — at frame N we read slot N%3 which holds frame N-3's queries +(oldest in the ring, ~50ms old at 60fps and definitely done across all +desktop GL drivers). Vendor-neutral: AMD/NVIDIA/Intel desktop GL all +work without driver-specific code. + +No new tests — the change is purely a diagnostic readout fix, no +observable behavior in the rendering path. Manual verification: +[WB-DIAG] now reports non-zero gpu_us at Holtburg radius=12. + +Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§4). + +Co-Authored-By: Claude Opus 4.7 (1M context) +'@ +git status +``` + +Expected: clean working tree after commit. Note the new commit SHA — needed for the baseline doc's "measured against" reference. + +--- + +## Task 2: Surface-format histogram dump path (part of commit 2 setup) + +**Files:** +- Modify: `src/AcDream.App/Rendering/TextureCache.cs` +- Modify: `src/AcDream.App/Rendering/GameWindow.cs` + +This task adds the env-gated one-shot dump infrastructure. It does NOT commit — the commit happens in Task 4 after the baseline document is also ready. + +- [ ] **Step 2.1: Add upload-time metadata tracking in `TextureCache.cs`** + +Add a new private dictionary that records `(width, height, formatLabel)` keyed by GL texture name. This lets `DumpSurfaceHistogram` emit dimension/format data without re-querying GL. + +Use Edit to insert the field right after the existing bindless cache fields (~line 41, just after `_bindlessByPalette`): + +**old_string:** +```csharp + private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new(); + + public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null) +``` + +**new_string:** +```csharp + private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new(); + + // Phase N.6 slice 1 (2026-05-11): per-upload metadata for the + // ACDREAM_DUMP_SURFACES=1 histogram dump path. Populated at upload + // time so the dump method doesn't have to query GL state. Keyed by + // GL texture name (same key used in cache value tuples). Format + // label is "RGBA8_DECODED" for the post-decode upload (all uploads + // currently land as RGBA8 regardless of source format). + private readonly Dictionary _uploadMetadata = new(); + + // Frame counter for the one-shot ACDREAM_DUMP_SURFACES=1 trigger. + // Increments per Tick call; fires the dump once at frame index 600 + // and never again for the session. See spec §5. + private int _dumpFrameCounter; + private bool _surfaceHistogramAlreadyDumped; + + public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null) +``` + +- [ ] **Step 2.2: Find the `UploadRgba8AsLayer1Array` method and record metadata there** + +Locate the method using Grep: + +``` +pattern: "UploadRgba8AsLayer1Array" +path: src/AcDream.App/Rendering/TextureCache.cs +output_mode: content +-n: true +``` + +Read the method body (typically ~30-50 lines) to find the exact `return name;` line. The decoded texture has `decoded.Width`, `decoded.Height`, and `decoded.Rgba8` available. + +For each `return name;` in `UploadRgba8AsLayer1Array(DecodedTexture decoded)`, insert this line immediately before it: + +```csharp + _uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED"); +``` + +If the method has only one `return name;` near its end, that's a single Edit. Use the surrounding 2-3 lines of context in `old_string` to make the Edit unique. + +- [ ] **Step 2.3: Also record metadata in the legacy `UploadRgba8` (non-bindless) path** + +Locate the method: + +``` +pattern: "private uint UploadRgba8\b" +path: src/AcDream.App/Rendering/TextureCache.cs +output_mode: content +-n: true +``` + +Apply the same `_uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED");` insertion before each `return name;` in `UploadRgba8(DecodedTexture decoded)`. This ensures the dump captures both legacy and modern uploads. + +- [ ] **Step 2.4: Add the `TickSurfaceHistogramDumpIfEnabled` public method to `TextureCache.cs`** + +Locate `HashPaletteOverride` using Grep: + +``` +pattern: "internal static ulong HashPaletteOverride" +path: src/AcDream.App/Rendering/TextureCache.cs +output_mode: content +-n: true +-A: 20 +``` + +Identify its closing brace. Use Edit with surrounding context to insert the new methods immediately after. + +**old_string:** (the last few lines of `HashPaletteOverride`): +```csharp + foreach (var sp in p.SubPalettes) + { + h = (h ^ sp.SubPaletteId) * prime; + h = (h ^ sp.Offset) * prime; + h = (h ^ sp.Length) * prime; + } + return h; + } +``` + +**new_string:** +```csharp + foreach (var sp in p.SubPalettes) + { + h = (h ^ sp.SubPaletteId) * prime; + h = (h ^ sp.Offset) * prime; + h = (h ^ sp.Length) * prime; + } + return h; + } + + /// + /// Phase N.6 slice 1: one-shot surface-format histogram dump for the + /// atlas-opportunity audit. Activated by ACDREAM_DUMP_SURFACES=1; fires + /// once at frame 600 of the session (~10s at 60fps, ~3s at 200fps — + /// both well past streaming settle at radius≤12). Output goes to + /// %LOCALAPPDATA%\acdream\n6-surfaces.txt. Zero cost when off. + /// See spec §5 in docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md. + /// + public void TickSurfaceHistogramDumpIfEnabled() + { + if (_surfaceHistogramAlreadyDumped) return; + if (!string.Equals(Environment.GetEnvironmentVariable("ACDREAM_DUMP_SURFACES"), "1", StringComparison.Ordinal)) return; + _dumpFrameCounter++; + if (_dumpFrameCounter < 600) return; + + DumpSurfaceHistogram(); + _surfaceHistogramAlreadyDumped = true; + } + + private void DumpSurfaceHistogram() + { + var localAppData = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData); + var outDir = System.IO.Path.Combine(localAppData, "acdream"); + System.IO.Directory.CreateDirectory(outDir); + var outPath = System.IO.Path.Combine(outDir, "n6-surfaces.txt"); + + var sb = new System.Text.StringBuilder(); + sb.AppendLine($"# acdream surface-format histogram — generated {DateTime.UtcNow:yyyy-MM-ddTHH:mm:ssZ}"); + sb.AppendLine("# Per-entry: surfaceId(hex), width, height, format, byteCount"); + sb.AppendLine(); + + // Walk every cached entry across the 6 caches, dedupe by GL name. + var seen = new HashSet(); + long totalBytes = 0; + var bucketsByDim = new Dictionary<(int W, int H), int>(); + var bucketsByFormat = new Dictionary(); + var bucketsByTriple = new Dictionary<(int W, int H, string F), int>(); + + void Emit(uint surfaceId, uint name) + { + if (!seen.Add(name)) return; + if (!_uploadMetadata.TryGetValue(name, out var meta)) return; + int bytes = meta.Width * meta.Height * 4; + totalBytes += bytes; + sb.AppendLine($"0x{surfaceId:X8}, {meta.Width}, {meta.Height}, {meta.Format}, {bytes}"); + + var dimKey = (meta.Width, meta.Height); + bucketsByDim[dimKey] = bucketsByDim.GetValueOrDefault(dimKey) + 1; + bucketsByFormat[meta.Format] = bucketsByFormat.GetValueOrDefault(meta.Format) + 1; + var tripleKey = (meta.Width, meta.Height, meta.Format); + bucketsByTriple[tripleKey] = bucketsByTriple.GetValueOrDefault(tripleKey) + 1; + } + + foreach (var kv in _handlesBySurfaceId) Emit(kv.Key, kv.Value); + foreach (var kv in _handlesByOverridden) Emit(kv.Key.surfaceId, kv.Value); + foreach (var kv in _handlesByPalette) Emit(kv.Key.surfaceId, kv.Value); + foreach (var kv in _bindlessBySurfaceId) Emit(kv.Key, kv.Value.Name); + foreach (var kv in _bindlessByOverridden) Emit(kv.Key.surfaceId, kv.Value.Name); + foreach (var kv in _bindlessByPalette) Emit(kv.Key.surfaceId, kv.Value.Name); + + sb.AppendLine(); + sb.AppendLine("# Rollups"); + sb.AppendLine($"# Total unique GL textures: {seen.Count}"); + sb.AppendLine($"# Total bytes (sum of W*H*4): {totalBytes}"); + + sb.AppendLine("# Top 10 (W,H) dimension buckets:"); + foreach (var kv in bucketsByDim.OrderByDescending(kv => kv.Value).Take(10)) + sb.AppendLine($"# {kv.Key.W}x{kv.Key.H}: {kv.Value}"); + + sb.AppendLine("# Format buckets:"); + foreach (var kv in bucketsByFormat.OrderByDescending(kv => kv.Value)) + sb.AppendLine($"# {kv.Key}: {kv.Value}"); + + sb.AppendLine("# Top 10 (W,H,format) triples — atlas-opportunity input:"); + foreach (var kv in bucketsByTriple.OrderByDescending(kv => kv.Value).Take(10)) + sb.AppendLine($"# {kv.Key.W}x{kv.Key.H} {kv.Key.F}: {kv.Value}"); + + System.IO.File.WriteAllText(outPath, sb.ToString()); + Console.WriteLine($"[N6-DUMP] Surface histogram written to {outPath} ({seen.Count} textures, {totalBytes} bytes)"); + } +``` + +- [ ] **Step 2.5: Confirm `using System.Linq;` is present in `TextureCache.cs`** + +Read the file's `using` section (top of file). If `using System.Linq;` is NOT present, add it. The `OrderByDescending` and `Take` calls in `DumpSurfaceHistogram` need it. + +Pattern: +``` +pattern: "^using System\.Linq" +path: src/AcDream.App/Rendering/TextureCache.cs +output_mode: count +``` + +If count is 0, add `using System.Linq;` in alphabetical order with the other usings at the top of the file. + +- [ ] **Step 2.6: Add the per-frame call site in `GameWindow.cs`** + +Find a stable insertion point near the top of `OnRender` (starts at line 6288). Use Grep: + +``` +pattern: "_gl!\.Clear\(" +path: src/AcDream.App/Rendering/GameWindow.cs +output_mode: content +-n: true +-A: 3 +``` + +This finds the `Clear` call(s) in or near `OnRender`. The first one after line 6288 is where you want to insert. Read 5 lines of context around it, then Edit to insert the dump tick on the line immediately after the `Clear` call returns: + +The insertion (one Edit): + +**old_string:** (find the `Clear` call in `OnRender` and capture 1-2 lines of its context — varies; common pattern is `_gl!.Clear(ClearBufferMask.ColorBufferBit | ClearBufferMask.DepthBufferBit);` followed by the next line of `OnRender` work). + +**new_string:** the same `Clear` call followed by: +```csharp + + // Phase N.6 slice 1: one-shot surface-format histogram dump under + // ACDREAM_DUMP_SURFACES=1. Zero cost when off. + _textureCache?.TickSurfaceHistogramDumpIfEnabled(); +``` + +If `OnRender` has multiple `Clear` calls, place the tick after the first one inside the method body. The call must run exactly once per frame, before any rendering work — placing it right after `Clear` accomplishes both. + +- [ ] **Step 2.7: Build** + +```powershell +dotnet build +``` + +Expected: build succeeds with no new warnings. If a "name 'OrderByDescending' does not exist in current context" error appears, Step 2.5 was missed — add the `using System.Linq;` and rebuild. + +- [ ] **Step 2.8: Run the test suite** + +```powershell +dotnet test --no-build +``` + +Expected: same pass/fail baseline (~1688 passing, ~8 pre-existing failures). No new failures. + +- [ ] **Step 2.9: Manual verification — confirm the dump file appears** + +Launch with the dump env var on: + +```powershell +$env:ACDREAM_DUMP_SURFACES = "1" +$env:ACDREAM_WB_DIAG = "1" +# Other env vars same as Task 1 Step 1.9 +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task2-verify.log" +``` + +Wait ~15 seconds after the window appears, then close it. Check the file: + +```powershell +Get-Content "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" | Select-Object -First 30 +``` + +Expected: a non-empty file with the header, per-entry rows, and rollup sections. Also confirm one `[N6-DUMP] Surface histogram written to ...` line in `task2-verify.log` (just before window close). + +If the file is empty or missing: +- Check the launch log for the `[N6-DUMP]` line. +- If it's not there, `_dumpFrameCounter` didn't reach 600 — the user closed too early. Re-run and wait longer. +- If it's there but the file lookup fails, the path output in the log should show what was actually written; investigate that path. + +**Do not commit yet.** Continue to Task 3. + +--- + +## Task 3: Capture baseline measurements + +**Files:** +- Create: `docs/plans/2026-05-11-phase-n6-perf-baseline.md` (final content lands in Task 4 — this task just collects the numbers). + +This is the manual measurement task. Each step launches the client, runs a specific scenario, and captures the diagnostic output. Save each log separately for the final write-up. Total expected time: ~30-45 min. + +Setup once per session: +```powershell +$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call" +$env:ACDREAM_LIVE = "1" +$env:ACDREAM_TEST_HOST = "127.0.0.1" +$env:ACDREAM_TEST_PORT = "9000" +$env:ACDREAM_TEST_USER = "testaccount" +$env:ACDREAM_TEST_PASS = "testpassword" +$env:ACDREAM_WB_DIAG = "1" +``` + +For each measurement run, set `ACDREAM_STREAM_RADIUS` before launch. Use the `QualityPreset=High` default (no overrides). All runs at Holtburg with `+Acdream` at clear midday (cycle weather with F10 → Clear, time with F7 → Noon). + +Per run, after ~30 seconds at the target condition, close the window and grep the log for the last 3 `[WB-DIAG]` lines — those have the steady-state numbers. + +- [ ] **Step 3.1: Capture radius=4 standstill** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "4" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-stand.log" +``` + +In-world: enter world, do not move, hold position for 30 seconds. Close. + +```powershell +Select-String -Path baseline-r4-stand.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3 +``` + +Record from the median of the last 3 lines: `cpu_us`, `gpu_us`, `entSeen`, `entDrawn`, `groups`. Also note the window-title FPS shown during the test. + +- [ ] **Step 3.2: Capture radius=4 walking** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "4" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-walk.log" +``` + +In-world: enter world, Tab to player mode, walk N→E→S→W across one landblock over ~30 seconds. Close. + +Capture same numbers as 3.1. + +- [ ] **Step 3.3: Capture radius=8 standstill** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "8" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-stand.log" +``` + +Same procedure as 3.1. Wait ~40 seconds before recording (streaming takes longer to settle). + +- [ ] **Step 3.4: Capture radius=8 walking** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "8" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-walk.log" +``` + +Same procedure as 3.2. + +- [ ] **Step 3.5: Capture radius=12 standstill** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "12" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-stand.log" +``` + +Same procedure as 3.1. Wait ~60 seconds before recording. This is the headline measurement — pay attention to whether `gpu_us` p95 is well below 16.6 ms (60 fps target) or pushing it. + +- [ ] **Step 3.6: Capture radius=12 walking** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "12" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-walk.log" +``` + +Same procedure as 3.2 (walking across one landblock, ~30 seconds of motion within the 60s+ window). + +- [ ] **Step 3.7: Capture the surface histogram** + +```powershell +$env:ACDREAM_STREAM_RADIUS = "12" +$env:ACDREAM_DUMP_SURFACES = "1" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-surfaces.log" +``` + +In-world: enter world at Holtburg, do nothing for ~30 seconds (let the dump fire at frame 600). Close. Copy the file: + +```powershell +Copy-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -Destination "baseline-surfaces.txt" +``` + +Inspect: +```powershell +Get-Content baseline-surfaces.txt | Select-Object -Last 40 +``` + +Record the rollup section (total textures, total bytes, top 10 dimension buckets, format distribution, top 10 (W,H,format) triples). + +- [ ] **Step 3.8: Clean up the env vars and the local app data dump** + +```powershell +Remove-Item Env:\ACDREAM_DUMP_SURFACES -ErrorAction SilentlyContinue +Remove-Item Env:\ACDREAM_STREAM_RADIUS -ErrorAction SilentlyContinue +# Optional: clean up the source file so a future re-measurement isn't confused by stale data +Remove-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -ErrorAction SilentlyContinue +``` + +All log files (`baseline-r*-*.log`, `baseline-surfaces.log`, `baseline-surfaces.txt`) remain in the worktree root for Task 4. They will NOT be committed — they're scratch. + +--- + +## Task 4: Write baseline doc + amend roadmap + ship commit 2 + +**Files:** +- Create: `docs/plans/2026-05-11-phase-n6-perf-baseline.md` +- Modify: `docs/plans/2026-04-11-roadmap.md` lines 690-705 + +- [ ] **Step 4.1: Write the baseline document** + +Use Write to create `docs/plans/2026-05-11-phase-n6-perf-baseline.md` with this content (substitute real numbers from Task 3 captures into every `` and `` placeholder; do NOT leave any unfilled): + +```markdown +# Phase N.6 slice 1 — perf baseline at Holtburg + +**Created:** 2026-05-11. +**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md) +**Measured against commit:** +**Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data. + +--- + +## §1. Setup + +- **Hardware:** Radeon RX 9070 XT +- **Resolution:** 1440p (2560×1440) +- **Quality preset:** High (default) +- **Connection:** live ACE at `127.0.0.1:9000` +- **Character:** `+Acdream` at Holtburg +- **Sky / time:** clear midday (F7 → Noon, F10 → Clear) +- **Build:** Debug +- **Date measured:** 2026-05-11 +- **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=` + +## §2. Dispatch CPU / GPU numbers + +Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window. `entSeen / entDrawn / groups` are also from those lines. FPS read from the window title. + +| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | FPS | entSeen | entDrawn | groups | +|---|---|---|---|---|---|---|---|---|---| +| 4 | standstill | | | | | | | | | +| 4 | walking | | | | | | | | | +| 8 | standstill | | | | | | | | | +| 8 | walking | | | | | | | | | +| 12| standstill | | | | | | | | | +| 12| walking | | | | | | | | | + +## §3. Surface-format histogram + +From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world. + +- **Total unique GL textures:** +- **Total bytes (sum of W*H*4):** +- **Top 10 (W, H) dimension buckets:** + - `x`: + - ... (paste from baseline-surfaces.txt rollup) +- **Format distribution:** + - ``: +- **Top 10 (W, H, format) triples — atlas-opportunity input:** + - `x `: + - ... + +**Atlas-opportunity score:** % of surfaces fall into the top-3 (W, H, format) triples. (A score >30% means atlas consolidation could meaningfully reduce sampler switches + memory overhead; <15% means scattered content and atlas is not worth the slice-2 effort.) + +## §4. Conclusion + next-phase recommendation + += 14000 µs: GPU-saturated, persistent-mapped buffers and compute cull help. + 3. Does the atlas score justify slice-2 atlas work? + 4. Given (1)-(3), which is the right next phase? + - CPU-bound + low atlas score: pivot to C.1.5 (visible content, perf already comfortable). + - GPU-bound + high atlas score: do N.6 slice 2 (atlas + persistent buffers). + - Either-bound + headroom + low atlas score: do C.1.5 first. + - GPU saturated + need for more headroom: escalate to Tier 2.> + +## §5. Raw logs + +Scratch logs from this measurement run (not committed): +- `baseline-r4-stand.log`, `baseline-r4-walk.log` +- `baseline-r8-stand.log`, `baseline-r8-walk.log` +- `baseline-r12-stand.log`, `baseline-r12-walk.log` +- `baseline-surfaces.log`, `baseline-surfaces.txt` +``` + +Fill in every `` and `` and the conclusion paragraph with the real values from Task 3. **Do NOT leave any `` placeholders.** If a measurement is missing, re-run that step from Task 3 before continuing. + +- [ ] **Step 4.2: Read the current roadmap N.6 entry** + +``` +Read offset 685, limit 25 from docs/plans/2026-04-11-roadmap.md +``` + +Confirm the bullet starts with `- **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).**` and ends with `Plan + spec written when work begins. **Estimate: 1-2 weeks.**`. Capture the exact text verbatim for Step 4.3's `old_string`. + +- [ ] **Step 4.3: Amend the roadmap entry** + +Use Edit. The change splits N.6 into slice 1 (shipping with this commit) and slice 2 (deferred until after C.1.5). + +**old_string:** the exact N.6 bullet copied from the Read in Step 4.2. + +**new_string:** +```markdown +- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.** + Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 + query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel + desktop GL). Added env-gated surface-format histogram dump in `TextureCache` + for atlas-opportunity audit. Captured authoritative baseline at Holtburg + radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` + diagnostic. Plan + spec at `docs/superpowers/{specs,plans}/2026-05-11-phase-n6-slice1-*.md`. + Baseline numbers + next-phase recommendation at + [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). +- **N.6 slice 2 — Perf polish cleanup.** **Planned — deferred until after C.1.5 + (PES emitter wiring) per the baseline doc's recommendation.** Builds on + slice 1's measurement. Scope: retire the legacy `Texture2D`/`sampler2D` path + in `TextureCache` (currently kept for Sky + Debug + particle paths now that + Terrain has migrated); delete orphan `mesh.frag` (verify zero callers post-N.5 + amendment); decide bindless-everywhere vs legacy-island for the remaining + `sampler2D` consumers; conditionally adopt WB atlas if the slice-1 histogram + shows a real opportunity; conditionally adopt persistent-mapped buffers if + the slice-1 baseline shows `BufferSubData` as a hot spot; GPU compute culling + remains out-of-scope (that's Tier 3 of the perf-tiers roadmap, gated on + Tier 2 first). Plan + spec written when work begins. **Estimate: 1-2 weeks + once C.1.5 lands.** +``` + +- [ ] **Step 4.4: Build (sanity check — only docs touched, but be safe)** + +```powershell +dotnet build +``` + +Expected: build succeeds. (No code touched in Task 4; this just confirms nothing was accidentally edited in src/.) + +- [ ] **Step 4.5: Commit 2** + +```powershell +git add src/AcDream.App/Rendering/TextureCache.cs ` + src/AcDream.App/Rendering/GameWindow.cs ` + docs/plans/2026-05-11-phase-n6-perf-baseline.md ` + docs/plans/2026-04-11-roadmap.md +git commit -m @' +docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path + +Capture authoritative CPU+GPU dispatch numbers at Holtburg with the +gpu_us diagnostic now working (commit ). Three +radii (4/8/12) × two motion modes (standstill/walking) + a surface-format +histogram from ACDREAM_DUMP_SURFACES=1. + +Adds env-gated one-shot dump path (TextureCache.TickSurfaceHistogramDumpIfEnabled, +called from GameWindow.OnRender) that fires once at frame 600 of the +session — zero cost when off, writes to %LOCALAPPDATA%\acdream\n6-surfaces.txt. + +Baseline document at docs/plans/2026-05-11-phase-n6-perf-baseline.md +closes with a recommendation paragraph for the next phase. Roadmap entry +amended to reflect the slice 1 / slice 2 split. + +Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§5, §6). + +Co-Authored-By: Claude Opus 4.7 (1M context) +'@ +git status +``` + +Expected: clean working tree. + +- [ ] **Step 4.6: Final sanity sweep** + +```powershell +git log -3 --oneline +``` + +Expected: two new commits from this slice (the GPU timing fix from Task 1.10, then this docs/perf commit), under the spec commit `05d590c`. + +Also confirm the scratch baseline-r*.log and baseline-surfaces.* files are still NOT in the commit (they were not staged): + +```powershell +git status +``` + +Expected: clean working tree. If the scratch logs show as untracked but uncommitted, that's fine — they can be deleted manually: + +```powershell +Remove-Item baseline-r*.log, baseline-surfaces.log, baseline-surfaces.txt, task1-verify.log, task2-verify.log -ErrorAction SilentlyContinue +``` + +--- + +## Acceptance check (spec §9) + +After Task 4 commits, walk through the spec's acceptance criteria and confirm each one. This is a paper-walk, not a re-run — the steps above produce the conditions. + +- [ ] **A1: `[WB-DIAG]` reports non-zero `gpu_us` at radius=12.** + Verified in Task 1.9 (initial check) and Task 3.5-3.6 (full baseline run). Confirm by re-grepping `baseline-r12-stand.log`: + ```powershell + Select-String -Path baseline-r12-stand.log -Pattern "gpu_us=[1-9]" + ``` + Should return at least one line. + +- [ ] **A2: Vendor-neutral.** No `GL_*_NV` or `GL_*_AMD` or `GL_*_INTEL` extension references in the change. Re-grep: + ```powershell + Select-String -Path src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs -Pattern "NV_|AMD_|INTEL_|GL_NV|GL_AMD|GL_INTEL" + ``` + Expected: no matches in the new code (matches elsewhere in the file from unrelated existing code don't count). + +- [ ] **A3: Baseline doc has real numbers + conclusion.** + Open `docs/plans/2026-05-11-phase-n6-perf-baseline.md` and visually confirm no ``, ``, `TBD`, or empty conclusion section. + +- [ ] **A4: Roadmap split shipped.** + ```powershell + Select-String -Path docs/plans/2026-04-11-roadmap.md -Pattern "N\.6 slice" + ``` + Expected: two matches (slice 1 + slice 2 bullets). + +- [ ] **A5: `dotnet build` green, no new warnings.** + ```powershell + dotnet build + ``` + Expected: succeeds. Note any new warnings vs the build output before the slice started. + +- [ ] **A6: `dotnet test` green at baseline (~1688 passing, ~8 pre-existing failures).** + ```powershell + dotnet test --no-build + ``` + Expected: pass count unchanged from before the slice started; failure list unchanged. + +- [ ] **A7: No visible regression.** + Confirmed during Task 1.9 and Task 3 measurements — the user was in-world repeatedly and didn't observe any rendering issue. If anything looked off during measurement, file it as an issue and decide whether it blocks slice 1 acceptance. + +If any acceptance criterion fails, return to the relevant task and re-do it. Do not declare slice 1 complete with failing acceptance. + +--- + +## After slice 1 lands + +The baseline document's conclusion paragraph (§4) determines the next phase: + +- **If conclusion recommends C.1.5:** brainstorm C.1.5 spec next, using [docs/plans/2026-04-27-phase-c1-pes-particles.md:285-295](../../plans/2026-04-27-phase-c1-pes-particles.md) as the starting scope. +- **If conclusion recommends N.6 slice 2:** brainstorm slice 2 spec next, addressing legacy `TextureCache` cleanup + atlas + persistent-mapped buffers based on the histogram data. +- **If conclusion recommends Tier 2:** consult [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md) and brainstorm a Tier 2 spec. + +The choice is data-driven; the recommendation paragraph is the contract. Don't re-litigate the decision once the numbers are in. From a7c98004bbb371fbdd3b145cc375690a143690f3 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 11:24:26 +0200 Subject: [PATCH 3/7] =?UTF-8?q?feat(perf):=20Phase=20N.6=20slice=201=20?= =?UTF-8?q?=E2=80=94=20fix=20gpu=5Fus=20double-buffering=20in=20WbDrawDisp?= =?UTF-8?q?atcher?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The dispatcher's GPU TimeElapsed queries were polled in the same frame as the indirect draw, so glGetQueryObject(ResultAvailable) always returned 0 and gpu_us in [WB-DIAG] was stuck at 0m/0p95. Replace the 2 single-handle queries with ring-of-3 arrays and move the result read to BEFORE issuing the next frame's queries into the same slot — at frame N we read slot N%3 which holds frame N-3's queries (oldest in the ring, ~50ms old at 60fps and definitely done across all desktop GL drivers). Vendor-neutral: AMD/NVIDIA/Intel desktop GL all work without driver-specific code. The gpuQuerySlot variable is hoisted to function scope (just before Phase 7 opaque pass) so both the opaque and transparent passes reference the same slot — the plan placed it inside the opaque-pass if-block, which would have been out of scope for the transparent BeginQuery; corrected in the implementation. No new tests — the change is purely a diagnostic readout fix, no observable behavior in the rendering path. Build green; tests at baseline (1711 passing, 8 pre-existing physics/MotionInterpreter failures unchanged). Manual gpu_us verification still pending in-world. Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§4). Plan: docs/superpowers/plans/2026-05-11-phase-n6-slice1.md (Task 1). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../Rendering/Wb/WbDrawDispatcher.cs | 72 +++++++++++++------ 1 file changed, 49 insertions(+), 23 deletions(-) diff --git a/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs b/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs index d0dbd82..605b1e6 100644 --- a/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs +++ b/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs @@ -152,8 +152,16 @@ public sealed unsafe class WbDrawDispatcher : IDisposable private readonly System.Diagnostics.Stopwatch _cpuStopwatch = new(); private readonly long[] _cpuSamples = new long[256]; // microseconds private int _cpuSampleCursor; - private uint _gpuQueryOpaque; - private uint _gpuQueryTransparent; + // GPU timing uses a ring of 3 query-pair slots so the read of frame N-3's + // result lands when the GPU has finished (~50ms after issue on a typical + // 60fps frame). Ring of 3 is the vendor-neutral choice: NVIDIA drivers with + // triple-buffering+vsync can queue ~3 frames ahead, AMD typically 1-2, + // Intel iGPUs vary. ResultAvailable is the safety guard if the GPU is + // still working when we try to read. + private const int GpuQueryRingDepth = 3; + private readonly uint[] _gpuQueryOpaque = new uint[GpuQueryRingDepth]; + private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth]; + private int _gpuQueryFrameIndex; private readonly long[] _gpuSamples = new long[256]; // microseconds private int _gpuSampleCursor; private bool _gpuQueriesInitialized; @@ -346,8 +354,11 @@ public sealed unsafe class WbDrawDispatcher : IDisposable if (diag && !_gpuQueriesInitialized) { - _gpuQueryOpaque = _gl.GenQuery(); - _gpuQueryTransparent = _gl.GenQuery(); + for (int i = 0; i < GpuQueryRingDepth; i++) + { + _gpuQueryOpaque[i] = _gl.GenQuery(); + _gpuQueryTransparent[i] = _gl.GenQuery(); + } _gpuQueriesInitialized = true; } @@ -754,6 +765,29 @@ public sealed unsafe class WbDrawDispatcher : IDisposable if (string.Equals(Environment.GetEnvironmentVariable("ACDREAM_NO_CULL"), "1", StringComparison.Ordinal)) _gl.Disable(EnableCap.CullFace); + // GPU timing: compute this frame's ring slot. We read frame N-3's + // result (the oldest data in the ring) before overwriting it with + // frame N's queries. Hoisted to function scope so both the opaque + // and transparent passes below can reference gpuQuerySlot. See spec + // §3 Q1/Q2 + §4 in + // docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md. + int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth; + if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) + { + _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail); + if (avail != 0) + { + _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.Result, out ulong opaqueNs); + _gl.GetQueryObject(_gpuQueryTransparent[gpuQuerySlot], QueryObjectParameterName.Result, out ulong transNs); + long gpuUs = (long)((opaqueNs + transNs) / 1000UL); + _gpuSamples[_gpuSampleCursor] = gpuUs; + _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length; + } + // If avail==0 the sample is dropped silently. MedianMicros + // computes over the non-zero subset, so dropped samples don't + // poison the median. + } + // ── Phase 7: opaque pass ───────────────────────────────────────────── if (_opaqueDrawCount > 0) { @@ -771,7 +805,7 @@ public sealed unsafe class WbDrawDispatcher : IDisposable // mesh_modern.vert for why this is needed. _shader.SetInt("uDrawIDOffset", 0); _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer); - if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque); + if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[gpuQuerySlot]); _gl.MultiDrawElementsIndirect( PrimitiveType.Triangles, DrawElementsType.UnsignedShort, @@ -820,7 +854,7 @@ public sealed unsafe class WbDrawDispatcher : IDisposable _gl.CullFace(TriangleFace.Back); _gl.FrontFace(FrontFaceDirection.Ccw); _shader.SetInt("uRenderPass", 1); - if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent); + if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent[gpuQuerySlot]); _gl.MultiDrawElementsIndirect( PrimitiveType.Triangles, DrawElementsType.UnsignedShort, @@ -843,21 +877,10 @@ public sealed unsafe class WbDrawDispatcher : IDisposable _cpuSamples[_cpuSampleCursor] = cpuUs; _cpuSampleCursor = (_cpuSampleCursor + 1) % _cpuSamples.Length; - // Read GPU samples non-blocking; the result for the previous frame's - // queries should be ready by now. If not, drop the sample (don't stall - // the CPU waiting for the GPU). - if (_gpuQueriesInitialized) - { - _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail); - if (avail != 0) - { - _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs); - _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs); - long gpuUs = (long)((opaqueNs + transNs) / 1000UL); - _gpuSamples[_gpuSampleCursor] = gpuUs; - _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length; - } - } + // GPU sample read happens BEFORE issuing the next frame's queries + // (see step 1.3 above). Increment the frame counter here so the + // next call computes a fresh slot. + if (_gpuQueriesInitialized) _gpuQueryFrameIndex++; _drawsIssued += _opaqueDrawCount + _transparentDrawCount; _instancesIssued += totalInstances; @@ -1139,8 +1162,11 @@ public sealed unsafe class WbDrawDispatcher : IDisposable _gl.DeleteBuffer(_indirectBuffer); if (_gpuQueriesInitialized) { - _gl.DeleteQuery(_gpuQueryOpaque); - _gl.DeleteQuery(_gpuQueryTransparent); + for (int i = 0; i < GpuQueryRingDepth; i++) + { + _gl.DeleteQuery(_gpuQueryOpaque[i]); + _gl.DeleteQuery(_gpuQueryTransparent[i]); + } } } From 25cb147d972fec907589e693520b0d2ef3933c49 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 11:28:22 +0200 Subject: [PATCH 4/7] fix(perf #N6.1): gate gpu_us read on diag for symmetric toggle behavior MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code-quality review on Task 1 (commit a7c9800) flagged an asymmetric diag gate: the read-before-overwrite block at the top of the dispatcher was not gated on diag, but the frame-counter increment and BeginQuery calls were. If a maintainer toggled ACDREAM_WB_DIAG from "1" to "" mid- session, _gpuQueryFrameIndex would freeze (gated inside if(diag)) while the read kept firing every frame at the same slot — producing duplicate stale samples. Add diag to the read block's outer condition so the read/issue/increment trio is symmetric. One-line change; behavior under the normal usage pattern (env var set at launch, never toggled) is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs b/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs index 605b1e6..36ebdc9 100644 --- a/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs +++ b/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs @@ -772,7 +772,11 @@ public sealed unsafe class WbDrawDispatcher : IDisposable // §3 Q1/Q2 + §4 in // docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md. int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth; - if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) + // diag is part of the gate so the read/issue/increment trio stays + // symmetric — without it, toggling ACDREAM_WB_DIAG mid-session would + // freeze the frame counter (gated by diag below) while the read kept + // re-reading the same slot, producing duplicate stale samples. + if (diag && _gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) { _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail); if (avail != 0) From 13abf96a5ece97016fac23f12af67794b6690255 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 12:34:10 +0200 Subject: [PATCH 5/7] =?UTF-8?q?docs(perf):=20Phase=20N.6=20slice=201=20?= =?UTF-8?q?=E2=80=94=20radius=3D12=20baseline=20+=20surface=20dump=20path?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Capture authoritative CPU+GPU dispatch numbers at Holtburg with the gpu_us diagnostic now working (commit 25cb147). Three radii (4/8/12) x two motion modes (standstill/walking) + a surface-format histogram from ACDREAM_DUMP_SURFACES=1. Adds env-gated one-shot dump path (TextureCache.TickSurfaceHistogramDumpIfEnabled, called from GameWindow.OnRender) that fires once after both (a) frame 600 of the session AND (b) the upload-metadata dict reaches 100 entries -- the cache-size gate prevents the dump from firing during pre-world GUI ticks where OnRender spins at high rates but no scenery has streamed. Output writes to %LOCALAPPDATA%\acdream\n6-surfaces.txt with a try/catch around the I/O so disk-full / permission errors don't crash mid-measurement. Baseline document at docs/plans/2026-05-11-phase-n6-perf-baseline.md documents: - CPU dominates GPU by 30-50x at every radius (strongly CPU-bound) - GPU wildly under-utilized (max gpu_us p95 ~600us vs 16,600us frame budget) - CPU scales superlinearly with N1 (Tier 1 cache wins on inner loop but not outer LB walk) - Surface atlas opportunity high (59% of textures in top-3 triples) but win is memory-only since GPU isn't bottlenecked Recommendation: C.1.5 (PES emitter wiring) next, then a reduced-scope N.6 slice 2 (drop atlas + persistent-mapped buffers -- not justified by the GPU under-utilization observed). Roadmap entry amended to split N.6 into slice 1 (shipped) and slice 2 (planned, reduced scope, deferred until after C.1.5). Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md. Plan: docs/superpowers/plans/2026-05-11-phase-n6-slice1.md (Task 4). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/plans/2026-04-11-roadmap.md | 36 ++-- .../2026-05-11-phase-n6-perf-baseline.md | 169 ++++++++++++++++++ src/AcDream.App/Rendering/GameWindow.cs | 4 + src/AcDream.App/Rendering/TextureCache.cs | 125 +++++++++++++ 4 files changed, 318 insertions(+), 16 deletions(-) create mode 100644 docs/plans/2026-05-11-phase-n6-perf-baseline.md diff --git a/docs/plans/2026-04-11-roadmap.md b/docs/plans/2026-04-11-roadmap.md index 2478aa4..7f535f3 100644 --- a/docs/plans/2026-04-11-roadmap.md +++ b/docs/plans/2026-04-11-roadmap.md @@ -687,22 +687,26 @@ for our deletions/additions; merge upstream `master` periodically. manifest at higher radius. Spec acceptance criterion #5 was wrong; amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. -- **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).** - Builds on N.5 + N.5b. Legacy renderer retirement was pulled forward - into N.5 ship amendment — `InstancedMeshRenderer`, `StaticMeshRenderer`, - `WbFoundationFlag` are gone — and the terrain legacy renderer - (`TerrainChunkRenderer` + `TerrainRenderer` + `terrain.vert/.frag`) - retired in N.5b. N.6 scope: WB atlas adoption for memory savings - on shared content, persistent-mapped buffers if `glBufferData` shows - up in profiling (the modern terrain path's per-frame DEIC `BufferSubData` - is a candidate), GPU-side culling via compute pre-pass (eliminates - the per-frame slot walk + DEIC build entirely), GL_TIME_ELAPSED query - double-buffering (deferred from N.5 — diagnostic shows `gpu_us=0/0` - under `ACDREAM_WB_DIAG=1`), direct higher-radius perf comparison (A.5 - has now landed — modern's architectural wins are measurable), retire the - legacy `Texture2D`/`sampler2D` path in `TextureCache` (currently kept - for Sky + Debug + particle paths now that Terrain has migrated). - Plan + spec written when work begins. **Estimate: 1-2 weeks.** +- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.** + Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 + query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel + desktop GL). Added env-gated surface-format histogram dump in `TextureCache` + for atlas-opportunity audit. Captured authoritative baseline at Holtburg + radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` + diagnostic. Plan + spec at `docs/superpowers/{specs,plans}/2026-05-11-phase-n6-slice1-*.md`. + Baseline numbers + next-phase recommendation at + [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). +- **N.6 slice 2 — Perf polish cleanup.** **Planned — deferred until after C.1.5 + (PES emitter wiring) per the baseline doc's recommendation.** Builds on + slice 1's measurement. Scope: retire the legacy `Texture2D`/`sampler2D` path + in `TextureCache` (currently kept for Sky + Debug + particle paths now that + Terrain has migrated); delete orphan `mesh.frag` (verify zero callers post-N.5 + amendment); decide bindless-everywhere vs legacy-island for the remaining + `sampler2D` consumers. **Dropped from slice 2 scope per baseline data**: + WB atlas adoption and persistent-mapped buffers — both target GPU/sampler + throughput but the baseline shows GPU is wildly under-utilized (max gpu_us + p95 ~600 µs vs 16,600 µs frame budget). Slice 2 reduces to a ~1-day cleanup. + Plan + spec written when work begins. **Estimate: ~1 day once C.1.5 lands.** - **N.7 — EnvCells / dungeons.** Replace EnvCell rendering with WB's `EnvCellRenderManager` + `PortalRenderManager` on top of N.4's foundation. **Estimate: 1-2 weeks** (was 2-3 — naturally smaller now diff --git a/docs/plans/2026-05-11-phase-n6-perf-baseline.md b/docs/plans/2026-05-11-phase-n6-perf-baseline.md new file mode 100644 index 0000000..75f6a8e --- /dev/null +++ b/docs/plans/2026-05-11-phase-n6-perf-baseline.md @@ -0,0 +1,169 @@ +# Phase N.6 slice 1 — perf baseline at Holtburg + +**Created:** 2026-05-11. +**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md) +**Measured against commit:** `25cb147` (Task 1 final — gpu_us fix + diag-gate symmetry follow-up) +**Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data. + +--- + +## §1. Setup + +- **Hardware:** Radeon RX 9070 XT +- **Resolution:** 1440p (2560×1440) +- **Quality preset:** High (default) +- **Connection:** live ACE at `127.0.0.1:9000` +- **Character:** `+Acdream` at Holtburg +- **Sky / time:** clear midday (F7 → Noon, F10 → Clear) +- **Build:** Debug +- **Date measured:** 2026-05-11 +- **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=` + +Note: `ACDREAM_STREAM_RADIUS=N` forces N₁=N (all N near-tier landblocks at full detail). +This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in +CLAUDE.md as comfortable 200–400 FPS at the default preset. These measurements +characterize the scaling curve — what happens as near-tier radius grows — not current +production behavior. FPS was not captured directly (no window-title screenshot per run); +it can be derived from `(1e6 / total_frame_time_us)` but the dispatcher's `cpu_us` is +only part of the frame (terrain, sky, particles, UI, GL submission overhead, and +swap-buffer wait are not included). + +## §2. Dispatch CPU / GPU numbers + +Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window. +`entSeen / entDrawn / groups / drawsIssued` are also from those lines (values per 5s bucket). +FPS column omitted — not captured per the note above. + +| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | entSeen (per 5s) | entDrawn (per 5s) | groups | drawsIssued (per 5s) | +|--------|------------|---------------|------------|---------------|------------|------------------|-------------------|--------|----------------------| +| 4 | standstill | 3,208 | 3,313 | 93 | 95 | 16.9M | 15.5M | 1,216 | 1.65M | +| 4 | walking | 2,967 | 3,112 | 95 | 120 | 13.9M | 13.9M | 1,850 | 1.45M | +| 8 | standstill | 6,732 | 7,199 | 126 | 130 | 19.8M | 19.8M | 333 | 218K | +| 8 | walking | 6,572 | 6,927 | 96 | 113 | 18.1M | 18.0M | 534 | 245K | +| 12 | standstill | 12,853 | 13,525 | 344 | 507 | 19.6M | 19.6M | 541 | 184K | +| 12 | walking | 16,320 | 17,241 | 553 | 603 | 17.8M | 17.8M | 898 | 200K | + +**Notable:** `meshMissing` counts at r4 standstill (~1.45M per 5s) drop to near-zero while +walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay +before populating for newly-streamed content. Not fatal — doesn't affect rendered output — +but worth a follow-up issue in `docs/ISSUES.md` if it persists in normal play. + +## §3. Surface-format histogram + +From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world. +Output written to `%LOCALAPPDATA%\acdream\n6-surfaces.txt`. + +- **Total unique GL textures:** 760 +- **Total bytes (sum of W×H×4):** 96,387,584 (~96.4 MB) + +**Top 10 (W, H) dimension buckets:** + +| Dimensions | Count | Share | +|------------|-------|-------| +| 128×128 | 236 | 31% | +| 64×64 | 111 | 15% | +| 256×256 | 102 | 13% | +| 128×256 | 71 | 9% | +| 64×128 | 69 | 9% | +| 256×128 | 48 | 6% | +| 128×64 | 39 | 5% | +| 512×512 | 30 | 4% | +| 8×8 | 18 | 2% | +| 32×32 | 14 | 2% | + +**Format distribution:** + +| Format | Count | Share | +|---------------|-------|-------| +| RGBA8_DECODED | 760 | 100% | + +All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc. +all decode through `TextureHelpers` before upload). The source-format diversity is real +but invisible to GL after the decode step. + +**Top 10 (W, H, format) triples — atlas-opportunity input:** + +Same as the dimension buckets above since there is only one format. The top-3 triples +(128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**. + +**Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples. +The spec §6 threshold for "atlas work is justified for memory savings" is >30%; this +measurement is well above it. However, see §4 for why atlas is not the right next step +despite the high score. + +## §4. Conclusion + next-phase recommendation + +### What the data shows + +**The entity dispatcher is strongly CPU-bound.** At every radius, CPU dominates GPU by +30–50×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the +ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck. + +**GPU is wildly under-utilized.** The highest gpu_us p95 observed is 603 µs at radius=12 +walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly +3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky, +particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU +comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged. + +**CPU scales superlinearly with N₁ (near-tier radius).** As N₁ grows from 4 → 8 → 12, +median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the +r4 baseline. The Tier 1 entity-classification cache (`EntityClassificationCache`, shipped +as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the +outer per-LB walk still scales with N₁. This is exactly what the Tier 2 plan (persistent +groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating +the per-frame LB scan entirely. + +**Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12 +(625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 +full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as +comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling +curve for headroom analysis, not the experience a typical player sees. + +**Atlas opportunity is high (59%) but the win is memory-only.** With 96 MB of textures +and 59% in the top-3 dimension buckets, atlas consolidation would reduce sampler-switch +count (currently near-zero already, since bindless textures are made resident once) and +shrink the texture memory footprint by roughly 40–50% through packing. But GPU is not +bottlenecked on sampler switches or memory bandwidth — the 0.6 ms gpu_us p95 at radius=12 +walking demonstrates this directly. Atlas adoption would cost 1–2 weeks of implementation +risk for a memory saving the process doesn't currently need at 96 MB. + +### Recommendation + +**Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces).** Four +reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset +per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline" +data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not +bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped +and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path +before any future shader migration work in slice 2 touches `particle.frag`. Starting +point for C.1.5 scoping: `docs/plans/2026-04-27-phase-c1-pes-particles.md` lines 285–295. + +**Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope.** The baseline data +justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely. +What remains is a ~1-day cleanup: retire orphan `mesh.frag` (verify zero callers post-N.5 +amendment), collapse dead `_handlesByOverridden` / `_handlesByPalette` legacy caches once +their callers are confirmed gone, migrate `particle.frag` to bindless sampling after C.1.5 +stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase. + +**Tertiary option (if perf escalation becomes pressing): Tier 2 first.** The scaling +curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the +bottleneck — exactly what Tier 2's persistent-group structure at +`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses. Not urgent at the current +default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the +200–400 FPS range at N₁=4 shrinks after more content is streamed. + +**Decision rule for revisiting:** if future measurement at the default preset shows +cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. +Otherwise, hold the C.1.5 → reduced-slice-2 sequence. + +## §5. Raw logs + +Scratch logs from this measurement run (not committed; can be deleted once the doc is +reviewed): + +- `baseline-r4-stand.log`, `baseline-r4-walk.log` +- `baseline-r8-stand.log`, `baseline-r8-walk.log` +- `baseline-r12-stand.log`, `baseline-r12-walk.log` +- `baseline-surfaces.log` (launch log for `ACDREAM_DUMP_SURFACES=1` run) +- `baseline-surfaces.txt` (copy of `%LOCALAPPDATA%\acdream\n6-surfaces.txt`) +- `task1-verify.log` (Task 1 manual verification log) diff --git a/src/AcDream.App/Rendering/GameWindow.cs b/src/AcDream.App/Rendering/GameWindow.cs index c3bba03..b81d484 100644 --- a/src/AcDream.App/Rendering/GameWindow.cs +++ b/src/AcDream.App/Rendering/GameWindow.cs @@ -6310,6 +6310,10 @@ public sealed class GameWindow : IDisposable _gl!.Clear(ClearBufferMask.ColorBufferBit | ClearBufferMask.DepthBufferBit); + // Phase N.6 slice 1: one-shot surface-format histogram dump under + // ACDREAM_DUMP_SURFACES=1. Zero cost when off. + _textureCache?.TickSurfaceHistogramDumpIfEnabled(); + // Phase N.4: drain WB pipeline queues (staged mesh data + // GL thread queue). Must happen before any draw work so that // resources uploaded this frame are available immediately. diff --git a/src/AcDream.App/Rendering/TextureCache.cs b/src/AcDream.App/Rendering/TextureCache.cs index 78eef29..5aea075 100644 --- a/src/AcDream.App/Rendering/TextureCache.cs +++ b/src/AcDream.App/Rendering/TextureCache.cs @@ -4,6 +4,7 @@ using AcDream.Core.World; using DatReaderWriter; using DatReaderWriter.DBObjs; using Silk.NET.OpenGL; +using System.Linq; using SurfaceType = DatReaderWriter.Enums.SurfaceType; namespace AcDream.App.Rendering; @@ -40,6 +41,20 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab private readonly Dictionary<(uint surfaceId, uint origTexOverride), (uint Name, ulong Handle)> _bindlessByOverridden = new(); private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new(); + // Phase N.6 slice 1 (2026-05-11): per-upload metadata for the + // ACDREAM_DUMP_SURFACES=1 histogram dump path. Populated at upload + // time so the dump method doesn't have to query GL state. Keyed by + // GL texture name (same key used in cache value tuples). Format + // label is "RGBA8_DECODED" for the post-decode upload (all uploads + // currently land as RGBA8 regardless of source format). + private readonly Dictionary _uploadMetadata = new(); + + // Frame counter for the one-shot ACDREAM_DUMP_SURFACES=1 trigger. + // Increments per Tick call; fires the dump once at frame index 600 + // and never again for the session. See spec §5. + private int _dumpFrameCounter; + private bool _surfaceHistogramAlreadyDumped; + public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null) { _gl = gl; @@ -258,6 +273,114 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab return h; } + /// + /// Phase N.6 slice 1: one-shot surface-format histogram dump for the + /// atlas-opportunity audit. Activated by ACDREAM_DUMP_SURFACES=1; fires + /// once after BOTH gates pass: + /// 1. _dumpFrameCounter >= 600 — at least 600 OnRender ticks + /// have elapsed (catches the "we're already past startup boilerplate" + /// bound; ~10s at 60fps, ~3s at 200fps). + /// 2. _uploadMetadata.Count >= 100 — the cache contains at + /// least 100 uploaded textures, indicating streaming has actually + /// pulled in world content (not just sky/UI/font). The original + /// frame-only gate fired during the login/handshake phase where + /// OnRender ticks at GUI rates but no world has streamed in. + /// Output goes to %LOCALAPPDATA%\acdream\n6-surfaces.txt. Zero cost + /// when off. See spec §5 in + /// docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md. + /// + public void TickSurfaceHistogramDumpIfEnabled() + { + if (_surfaceHistogramAlreadyDumped) return; + if (!string.Equals(System.Environment.GetEnvironmentVariable("ACDREAM_DUMP_SURFACES"), "1", StringComparison.Ordinal)) return; + _dumpFrameCounter++; + if (_dumpFrameCounter < 600) return; + if (_uploadMetadata.Count < 100) return; + + DumpSurfaceHistogram(); + _surfaceHistogramAlreadyDumped = true; + } + + private void DumpSurfaceHistogram() + { + try + { + DumpSurfaceHistogramCore(); + } + catch (Exception ex) + { + // Diagnostic-only path. If the dump file can't be written + // (disk full, permission denied, antivirus lock, path too + // long) we must NOT crash OnRender — that would invalidate + // the very measurement pass this diagnostic is meant to + // support. Log to stderr and let the caller mark the dump + // as "already done" so it doesn't retry every frame. + Console.Error.WriteLine($"[N6-DUMP] Failed to write surface histogram: {ex.Message}"); + } + } + + private void DumpSurfaceHistogramCore() + { + var localAppData = System.Environment.GetFolderPath(System.Environment.SpecialFolder.LocalApplicationData); + var outDir = System.IO.Path.Combine(localAppData, "acdream"); + System.IO.Directory.CreateDirectory(outDir); + var outPath = System.IO.Path.Combine(outDir, "n6-surfaces.txt"); + + var sb = new System.Text.StringBuilder(); + sb.AppendLine($"# acdream surface-format histogram — generated {DateTime.UtcNow:yyyy-MM-ddTHH:mm:ssZ}"); + sb.AppendLine("# Per-entry: surfaceId(hex), width, height, format, byteCount"); + sb.AppendLine(); + + // Walk every cached entry across the 6 caches, dedupe by GL name. + var seen = new HashSet(); + long totalBytes = 0; + var bucketsByDim = new Dictionary<(int W, int H), int>(); + var bucketsByFormat = new Dictionary(); + var bucketsByTriple = new Dictionary<(int W, int H, string F), int>(); + + void Emit(uint surfaceId, uint name) + { + if (!seen.Add(name)) return; + if (!_uploadMetadata.TryGetValue(name, out var meta)) return; + int bytes = meta.Width * meta.Height * 4; + totalBytes += bytes; + sb.AppendLine($"0x{surfaceId:X8}, {meta.Width}, {meta.Height}, {meta.Format}, {bytes}"); + + var dimKey = (meta.Width, meta.Height); + bucketsByDim[dimKey] = bucketsByDim.GetValueOrDefault(dimKey) + 1; + bucketsByFormat[meta.Format] = bucketsByFormat.GetValueOrDefault(meta.Format) + 1; + var tripleKey = (meta.Width, meta.Height, meta.Format); + bucketsByTriple[tripleKey] = bucketsByTriple.GetValueOrDefault(tripleKey) + 1; + } + + foreach (var kv in _handlesBySurfaceId) Emit(kv.Key, kv.Value); + foreach (var kv in _handlesByOverridden) Emit(kv.Key.surfaceId, kv.Value); + foreach (var kv in _handlesByPalette) Emit(kv.Key.surfaceId, kv.Value); + foreach (var kv in _bindlessBySurfaceId) Emit(kv.Key, kv.Value.Name); + foreach (var kv in _bindlessByOverridden) Emit(kv.Key.surfaceId, kv.Value.Name); + foreach (var kv in _bindlessByPalette) Emit(kv.Key.surfaceId, kv.Value.Name); + + sb.AppendLine(); + sb.AppendLine("# Rollups"); + sb.AppendLine($"# Total unique GL textures: {seen.Count}"); + sb.AppendLine($"# Total bytes (sum of W*H*4): {totalBytes}"); + + sb.AppendLine("# Top 10 (W,H) dimension buckets:"); + foreach (var kv in bucketsByDim.OrderByDescending(kv => kv.Value).Take(10)) + sb.AppendLine($"# {kv.Key.W}x{kv.Key.H}: {kv.Value}"); + + sb.AppendLine("# Format buckets:"); + foreach (var kv in bucketsByFormat.OrderByDescending(kv => kv.Value)) + sb.AppendLine($"# {kv.Key}: {kv.Value}"); + + sb.AppendLine("# Top 10 (W,H,format) triples — atlas-opportunity input:"); + foreach (var kv in bucketsByTriple.OrderByDescending(kv => kv.Value).Take(10)) + sb.AppendLine($"# {kv.Key.W}x{kv.Key.H} {kv.Key.F}: {kv.Value}"); + + System.IO.File.WriteAllText(outPath, sb.ToString()); + Console.WriteLine($"[N6-DUMP] Surface histogram written to {outPath} ({seen.Count} textures, {totalBytes} bytes)"); + } + private DecodedTexture DecodeFromDats(uint surfaceId, uint? origTextureOverride, PaletteOverride? paletteOverride) { var surface = _dats.Get(surfaceId); @@ -364,6 +487,7 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab _gl.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureWrapT, (int)TextureWrapMode.Repeat); _gl.BindTexture(TextureTarget.Texture2D, 0); + _uploadMetadata[tex] = (decoded.Width, decoded.Height, "RGBA8_DECODED"); return tex; } @@ -396,6 +520,7 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab _gl.TexParameter(TextureTarget.Texture2DArray, TextureParameterName.TextureWrapT, (int)TextureWrapMode.Repeat); _gl.BindTexture(TextureTarget.Texture2DArray, 0); + _uploadMetadata[tex] = (decoded.Width, decoded.Height, "RGBA8_DECODED"); return tex; } From 76ca3ffca88bebce365acc9b59e897123f6537d8 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 12:43:35 +0200 Subject: [PATCH 6/7] docs(perf #N6.1): apply code-quality review fixes to baseline doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code-quality review on commit 13abf96 flagged 3 Important issues in the baseline document plus 2 minor roadmap consistency gaps. Applied all of them: 1. The "CPU scales superlinearly with N₁" claim was imprecise because CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×). Clarified: CPU grows more than linearly with radius N₁ but sublinearly with visible-LB count; frustum cull discards most far LBs early. The outer per-LB walk still scales with N₁, which is what Tier 2's persistent groups address. 2. The "40-50% memory footprint reduction from atlas packing" estimate was asserted without derivation and likely too optimistic given all surfaces are already power-of-two and same-format (RGBA8). Replaced with a more honest bound: "low-MB to ~10 MB absolute saving" with explicit per-array metadata overhead reasoning. Conclusion is unchanged — atlas adoption still isn't justified given GPU under-utilization. 3. The "spec §6 threshold for atlas is >30%" citation pointed at text that doesn't exist in the spec. Replaced with "A conventional rule-of-thumb" so a future reader doesn't chase a phantom citation. Plus roadmap consistency: M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title. Shipped YYYY-MM-DD." prefix that every other shipped phase uses. M2: Added N.6.1 row to the shipped table at the top of the roadmap (lines ~55-66) so the at-a-glance shipped list is complete. None of these change the conclusion or the next-phase recommendation (C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy and future-readability. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/plans/2026-04-11-roadmap.md | 3 +- .../2026-05-11-phase-n6-perf-baseline.md | 43 +++++++++++-------- 2 files changed, 28 insertions(+), 18 deletions(-) diff --git a/docs/plans/2026-04-11-roadmap.md b/docs/plans/2026-04-11-roadmap.md index 7f535f3..cab23c6 100644 --- a/docs/plans/2026-04-11-roadmap.md +++ b/docs/plans/2026-04-11-roadmap.md @@ -63,6 +63,7 @@ | N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ | | N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ | | N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ | +| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ | Plus polish that doesn't get its own phase number: - FlyCamera default speed lowered + Shift-to-boost @@ -687,7 +688,7 @@ for our deletions/additions; merge upstream `master` periodically. manifest at higher radius. Spec acceptance criterion #5 was wrong; amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. -- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.** +- **✓ SHIPPED — N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** Shipped 2026-05-11. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated surface-format histogram dump in `TextureCache` diff --git a/docs/plans/2026-05-11-phase-n6-perf-baseline.md b/docs/plans/2026-05-11-phase-n6-perf-baseline.md index 75f6a8e..ba5f0a1 100644 --- a/docs/plans/2026-05-11-phase-n6-perf-baseline.md +++ b/docs/plans/2026-05-11-phase-n6-perf-baseline.md @@ -87,9 +87,9 @@ Same as the dimension buckets above since there is only one format. The top-3 tr (128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**. **Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples. -The spec §6 threshold for "atlas work is justified for memory savings" is >30%; this -measurement is well above it. However, see §4 for why atlas is not the right next step -despite the high score. +A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas +packing worth the implementation cost for memory savings; this measurement is well above +that. However, see §4 for why atlas is not the right next step despite the high score. ## §4. Conclusion + next-phase recommendation @@ -105,13 +105,17 @@ walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at r particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged. -**CPU scales superlinearly with N₁ (near-tier radius).** As N₁ grows from 4 → 8 → 12, -median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the -r4 baseline. The Tier 1 entity-classification cache (`EntityClassificationCache`, shipped -as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the -outer per-LB walk still scales with N₁. This is exactly what the Tier 2 plan (persistent -groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating -the per-frame LB scan entirely. +**CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with +visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms → +6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count +scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0× +vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs +early, but the outer per-LB walk still has to touch each one. The Tier 1 entity- +classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner +loop (per-entity classification avoided on cache hits) but the outer walk dominates +as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at +`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the +per-frame LB scan entirely. **Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 @@ -119,13 +123,18 @@ full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling curve for headroom analysis, not the experience a typical player sees. -**Atlas opportunity is high (59%) but the win is memory-only.** With 96 MB of textures -and 59% in the top-3 dimension buckets, atlas consolidation would reduce sampler-switch -count (currently near-zero already, since bindless textures are made resident once) and -shrink the texture memory footprint by roughly 40–50% through packing. But GPU is not -bottlenecked on sampler switches or memory bandwidth — the 0.6 ms gpu_us p95 at radius=12 -walking demonstrates this directly. Atlas adoption would cost 1–2 weeks of implementation -risk for a memory saving the process doesn't currently need at 96 MB. +**Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB +of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the +top buckets share single `Texture2DArray` objects rather than each surface owning its +own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture +binds — are already near-zero because bindless textures are made resident once at upload +and never bound per draw. The remaining win is the per-array metadata overhead × N +surfaces, which is bounded but not dramatic given all surfaces are already power-of-two +and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on +the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler +switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this +directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory +saving the process doesn't currently need at 96 MB. ### Recommendation From 41981c4d74f9c621b43984d0ce7dd1975f951472 Mon Sep 17 00:00:00 2001 From: Erik Date: Mon, 11 May 2026 12:51:10 +0200 Subject: [PATCH 7/7] =?UTF-8?q?docs(perf=20#N6.1):=20apply=20final-review?= =?UTF-8?q?=20fixes=20=E2=80=94=20spec,=20baseline=20doc,=20issue=20#55?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Final code review of slice 1 flagged one Important issue (the spec's "zero cost when off" claim for the surface-dump path is technically violated — _uploadMetadata always writes one dict entry per upload regardless of env var) plus minor doc/consistency gaps. Applied: 1. Spec §5 "Cost when off": dropped the "Zero" claim; replaced with "Negligible — one Dictionary write per upload (~30-50 KB at Holtburg) plus a hash-table write per upload. Expensive work (file I/O, histogram construction) is still env-gated." This matches reality. 2. Baseline doc §5: rewrote from "Raw logs (scratch, can be deleted)" referencing files that were never preserved in this worktree, to "Reproducing the measurements" with the actual PowerShell launch commands. Honest about the raw logs not being kept; the captured medians in section 2 are the canonical record. 3. New issue #55 filed in docs/ISSUES.md — static-entity slow path reports ~1.45M meshMissing/5s at r4 standstill, drops to ~0 when walking. LOW severity (no visible regression), hypothesis points at a "permanently-missing entity gets re-classified every frame" pattern that Tier 1 cache doesn't cover. 4. Roadmap shipped table: renamed "N.6.1" row to "N.6 slice 1" to match every other artifact's naming. Search-discoverability fix. None of these change the slice's conclusion or next-phase recommendation (C.1.5 first, then reduced-scope slice 2). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/ISSUES.md | 34 +++++++++++++++++++ docs/plans/2026-04-11-roadmap.md | 2 +- .../2026-05-11-phase-n6-perf-baseline.md | 33 +++++++++++++----- .../2026-05-11-phase-n6-slice1-design.md | 2 +- 4 files changed, 60 insertions(+), 11 deletions(-) diff --git a/docs/ISSUES.md b/docs/ISSUES.md index 3565c30..e6d52f4 100644 --- a/docs/ISSUES.md +++ b/docs/ISSUES.md @@ -46,6 +46,40 @@ Copy this block when adding a new issue: # Active issues +## #55 — Static-entity slow path reports ~1.45M `meshMissing` per 5s at r4 standstill + +**Status:** OPEN +**Severity:** LOW (no visible regression — affects a diagnostic counter, not rendered output) +**Filed:** 2026-05-11 +**Component:** rendering / `WbDrawDispatcher` static-entity classification path + +**Description:** During the Phase N.6 slice 1 baseline measurement (`docs/plans/2026-05-11-phase-n6-perf-baseline.md` §2), +the radius=4 standstill scenario reported `meshMissing ≈ 1,450,000` per 5-second +`[WB-DIAG]` window. The same scenario while walking drops to near-zero (`meshMissing = 0` +in the steady state) as new landblocks stream in and previously-missing meshes resolve. +This suggests the static-entity slow path's mesh-load lifecycle has some delay before +populating for newly-streamed content but eventually catches up; the standstill case +keeps re-counting the same set of entities-with-unresolved-meshes for the duration of +the run. The counter is per-frame so the absolute number scales with FPS — at the +measured ~150 FPS that's ~290K reports/s, or ~1900 entities each reported each frame. + +**Root cause / status:** Not investigated. Hypothesis: an entity classification path +counts mesh-missing on every frame for static entities whose `MeshRef` resolution races +the streaming loader. The Tier 1 cache (#53) populates only for entities whose +classification succeeded, so persistently-failing entities run the slow path every frame +forever and bump `meshMissing` every time. If true, the fix is either (a) cache the +"this entity's mesh genuinely doesn't exist" result so we stop re-checking, or (b) +deferred-classify the entity once its `MeshRef` resolves. + +**Files:** `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` (the slow path that +increments `_meshesMissing`), `src/AcDream.App/Rendering/Wb/EntityClassificationCache.cs` +(the Tier 1 cache — likely needs to learn about "permanently missing" entries). + +**Acceptance:** `meshMissing` should drop to near-zero within ~5 seconds of streaming +settle at any radius/motion combination, not stay at ~1.45M/5s indefinitely at standstill. + +--- + ## #50 — Road-edge tree at 0xA9B1 visible in acdream but not retail **Status:** OPEN diff --git a/docs/plans/2026-04-11-roadmap.md b/docs/plans/2026-04-11-roadmap.md index cab23c6..91b674a 100644 --- a/docs/plans/2026-04-11-roadmap.md +++ b/docs/plans/2026-04-11-roadmap.md @@ -63,7 +63,7 @@ | N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ | | N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ | | N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ | -| N.6.1 | Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ | +| N.6 slice 1 | GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ | Plus polish that doesn't get its own phase number: - FlyCamera default speed lowered + Shift-to-boost diff --git a/docs/plans/2026-05-11-phase-n6-perf-baseline.md b/docs/plans/2026-05-11-phase-n6-perf-baseline.md index ba5f0a1..93870e7 100644 --- a/docs/plans/2026-05-11-phase-n6-perf-baseline.md +++ b/docs/plans/2026-05-11-phase-n6-perf-baseline.md @@ -165,14 +165,29 @@ default N₁=4; worth revisiting if a future quality preset wants N₁=8 as defa cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. Otherwise, hold the C.1.5 → reduced-slice-2 sequence. -## §5. Raw logs +## §5. Reproducing the measurements -Scratch logs from this measurement run (not committed; can be deleted once the doc is -reviewed): +Raw `[WB-DIAG]` output from each run was inspected live during measurement and the +median of the last three steady-state lines from each scenario was transcribed into §2. +The raw launch logs were not preserved — the captured medians in §2 are the canonical +record. To reproduce on the same hardware: -- `baseline-r4-stand.log`, `baseline-r4-walk.log` -- `baseline-r8-stand.log`, `baseline-r8-walk.log` -- `baseline-r12-stand.log`, `baseline-r12-walk.log` -- `baseline-surfaces.log` (launch log for `ACDREAM_DUMP_SURFACES=1` run) -- `baseline-surfaces.txt` (copy of `%LOCALAPPDATA%\acdream\n6-surfaces.txt`) -- `task1-verify.log` (Task 1 manual verification log) +```powershell +$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call" +$env:ACDREAM_LIVE = "1" +$env:ACDREAM_TEST_HOST = "127.0.0.1" +$env:ACDREAM_TEST_PORT = "9000" +$env:ACDREAM_TEST_USER = "testaccount" +$env:ACDREAM_TEST_PASS = "testpassword" +$env:ACDREAM_WB_DIAG = "1" +$env:ACDREAM_STREAM_RADIUS = "4" # or 8, 12 +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline.log" +``` + +Stand still for ~30 s at the target radius (60 s at radius 12 to let streaming settle), +or walk N→E→S→W across one landblock. Then `Select-String -Path baseline.log -Pattern +"\[WB-DIAG\]" | Select-Object -Last 3` captures the steady-state numbers. + +For the surface histogram, also set `$env:ACDREAM_DUMP_SURFACES = "1"`, stay in-world +~30 s after streaming has loaded ≥100 textures (the cache-size gate), then read +`$env:LOCALAPPDATA\acdream\n6-surfaces.txt`. diff --git a/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md b/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md index 3c35307..ec80af2 100644 --- a/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md +++ b/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md @@ -196,7 +196,7 @@ Plus rollups at the end: ### Cost when off -Zero — gated by the env-var check. The dump method is only called from a guarded `if` in `GameWindow.cs`. +Negligible — one `Dictionary` write per `UploadRgba8`/`UploadRgba8AsLayer1Array` call (the `_uploadMetadata` insertion is unconditional so the dump path doesn't have to query GL state when it does fire). At Holtburg with 760 textures that's ~30–50 KB of process memory and one hash-table write per upload — invisible at runtime, no GC pressure. The expensive work (file I/O, histogram construction) is gated by the env-var check inside `TickSurfaceHistogramDumpIfEnabled` and only runs when `ACDREAM_DUMP_SURFACES=1`. ---