acdream/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
Erik 41981c4d74 docs(perf #N6.1): apply final-review fixes — spec, baseline doc, issue #55
Final code review of slice 1 flagged one Important issue (the spec's
"zero cost when off" claim for the surface-dump path is technically
violated — _uploadMetadata always writes one dict entry per upload
regardless of env var) plus minor doc/consistency gaps. Applied:

1. Spec §5 "Cost when off": dropped the "Zero" claim; replaced with
   "Negligible — one Dictionary write per upload (~30-50 KB at Holtburg)
   plus a hash-table write per upload. Expensive work (file I/O,
   histogram construction) is still env-gated." This matches reality.

2. Baseline doc §5: rewrote from "Raw logs (scratch, can be deleted)"
   referencing files that were never preserved in this worktree, to
   "Reproducing the measurements" with the actual PowerShell launch
   commands. Honest about the raw logs not being kept; the captured
   medians in section 2 are the canonical record.

3. New issue #55 filed in docs/ISSUES.md — static-entity slow path
   reports ~1.45M meshMissing/5s at r4 standstill, drops to ~0 when
   walking. LOW severity (no visible regression), hypothesis points
   at a "permanently-missing entity gets re-classified every frame"
   pattern that Tier 1 cache doesn't cover.

4. Roadmap shipped table: renamed "N.6.1" row to "N.6 slice 1" to
   match every other artifact's naming. Search-discoverability fix.

None of these change the slice's conclusion or next-phase
recommendation (C.1.5 first, then reduced-scope slice 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:51:10 +02:00

20 KiB
Raw Blame History

Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline (design)

Created: 2026-05-11. Status: approved design, ready for implementation plan. Phase context: Phase N.6 (perf polish) split into two slices on 2026-05-11 — this is slice 1. Slice 2 (legacy TextureCache cleanup + shader migration + optional persistent-mapped buffers) is deferred until after C.1.5 (PES emitter wiring), and gets its own spec then. Roadmap entry: docs/plans/2026-04-11-roadmap.md lines 690-705 (to be amended in commit 2 to reflect the slice split).


§1. Problem

WbDrawDispatcher runs glBeginQuery(GL_TIME_ELAPSED, …) … glEndQuery around the opaque and transparent indirect draws, then immediately polls glGetQueryObject(…, ResultAvailable, …) on the same frame to read the result. The GPU has not finished executing the draw by the time the polling call runs, so avail is always 0, the sample is dropped, and the _gpuSamples ring stays all-zero forever. The user sees gpu_us=0m/0p95 in every [WB-DIAG] line under ACDREAM_WB_DIAG=1.

Verified at src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs:849-859.

Without this fix:

  • Every future perf decision (Tier 2 vs Tier 3 vs slice 2 vs do-nothing) is made on CPU-only data.
  • We cannot tell whether the dispatcher is CPU-bound or GPU-bound at radius=12.
  • We cannot validate that N.5/N.5b/Tier 1 changes actually moved GPU time.

This slice ships the GPU-timing fix and uses the now-working diagnostic to produce one authoritative perf baseline document so the next phase decision (slice 2 vs C.1.5 vs Tier 2/3) is data-driven.


§2. Goals and non-goals

Goals

  1. [WB-DIAG] reports non-zero gpu_us for the entity dispatcher's opaque+transparent passes at Holtburg radius=12 with ACDREAM_WB_DIAG=1.
  2. The fix works on AMD, NVIDIA, and Intel desktop OpenGL drivers without vendor-specific code paths.
  3. Produce a baseline document at docs/plans/2026-05-11-phase-n6-perf-baseline.md with CPU and GPU numbers across radii 4 / 8 / 12 (standstill + walking), a surface-format histogram, and a memory snapshot.
  4. The baseline document closes with a recommendation paragraph: should the next phase be N.6 slice 2 (perf cleanup), C.1.5 (PES wiring), or escalation to Tier 2 (static/dynamic split). Rationale grounded in the captured numbers.
  5. dotnet build and dotnet test green; no functional regression in the rendering path.

Non-goals

  • Persistent-mapped buffers (BufferSubDataGL_MAP_PERSISTENT_BIT). Deferred to slice 2 unless the baseline shows it's a hot spot.
  • Legacy TextureCache cleanup, mesh.frag orphan deletion, sky/UI text shader migration to bindless. All deferred to slice 2.
  • WB atlas adoption / texture-array consolidation. Deferred to slice 2 pending the surface histogram from goal 3.
  • Adding GPU queries to terrain / sky / particle / debug-line passes. Slice 1 keeps query scope to the existing two queries inside WbDrawDispatcher (opaque-pass + transparent-pass).
  • GPU compute culling. That's Tier 3 of docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md, separate roadmap.

§3. Design decisions (from brainstorming, 2026-05-11)

# Decision Rationale
Q1 Ring of 3 query-pair slots (not ring of 2) Vendor-neutral. NVIDIA drivers with triple-buffering + vsync can queue ~3 frames ahead; AMD typically 12; Intel iGPUs vary. Ring of 2 plus ResultAvailable guard works everywhere but drops more samples on deeper queues. Ring of 3 collects samples reliably across all desktop drivers. Cost: one extra GLuint query pair (~12 bytes of GPU state) plus one frame of latency on the printed value, which is invisible because the diagnostic is a 256-frame moving-window median.
Q2 Read-before-issue, same-slot pattern On frame N, attempt to read slot N%3 (which contains frame N-3's result — the oldest unread data, ~50 ms ago at 60 fps) before overwriting it with frame N's queries. Reading the oldest data maximizes the chance that ResultAvailable=1 across all desktop drivers. Use ResultAvailable as a guard — if not ready, skip the sample. MedianMicros already computes over the non-zero subset, so dropped samples don't poison the result.
Q3 Keep query scope unchanged — just the two existing queries (opaque-pass + transparent-pass for the WB dispatcher) Slice 1 is "fix what's broken," not "expand instrumentation." Adding terrain / sky / particle queries is slice-2-or-later work and would inflate this slice past the half-day budget.
Q4 Surface-format histogram via env-gated one-shot dump (ACDREAM_DUMP_SURFACES=1) The atlas-adoption decision in slice 2 needs to know whether enough surfaces share dimensions/format to make consolidation worthwhile. A one-time dump on first frame to a fixed file path is cheap to implement, zero cost when off, and lets the user re-run cheaply when needed. Output goes to %LOCALAPPDATA%\acdream\n6-surfaces.txt (not stdout) to avoid spamming the launch log.
Q5 Two commits, not one Commit 1 is the GPU-timing fix (code change, regression-bisectable). Commit 2 is the surface-dump path + baseline document (docs + env-gated diag). Keeping them separate means a future bisect for a GPU-timing regression doesn't land on a doc commit.
Q6 Baseline measurement is Holtburg + High preset only (per the user's hardware) Slice 1 doesn't pretend to be a cross-hardware perf survey. It's one canonical measurement on the dev machine. The document template captures setup explicitly so a NVIDIA / lower-end run can be added later without re-architecting the doc.

§4. Change 1 — GPU query double-buffering

Files touched

  • src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs — single-file change, ~30 LOC delta.

Current state (verified)

// Field declarations near line 155:
private uint _gpuQueryOpaque;
private uint _gpuQueryTransparent;
private readonly long[] _gpuSamples = new long[256];
private bool _gpuQueriesInitialized;

// Init at line ~347:
if (diag && !_gpuQueriesInitialized) {
    _gpuQueryOpaque      = _gl.GenQuery();
    _gpuQueryTransparent = _gl.GenQuery();
    _gpuQueriesInitialized = true;
}

// Around the opaque draw at line ~774:
if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque);
 opaque indirect draw 
if (diag && _gpuQueriesInitialized) _gl.EndQuery(QueryTarget.TimeElapsed);

// Same pattern around transparent draw at line ~823.

// Read at line ~849 — BUG: same frame, never ready:
if (_gpuQueriesInitialized) {
    _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail);
    if (avail != 0) {
        _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs);
        _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs);
        long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
        _gpuSamples[_gpuSampleCursor] = gpuUs;
        _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
    }
}

// Dispose at line ~1140:
if (_gpuQueriesInitialized) {
    _gl.DeleteQuery(_gpuQueryOpaque);
    _gl.DeleteQuery(_gpuQueryTransparent);
}

Target state

private const int GpuQueryRingDepth = 3;
private readonly uint[] _gpuQueryOpaque      = new uint[GpuQueryRingDepth];
private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth];
private int _gpuQueryFrameIndex;  // increments every frame we issue queries
private bool _gpuQueriesInitialized;

// Init:
if (diag && !_gpuQueriesInitialized) {
    for (int i = 0; i < GpuQueryRingDepth; i++) {
        _gpuQueryOpaque[i]      = _gl.GenQuery();
        _gpuQueryTransparent[i] = _gl.GenQuery();
    }
    _gpuQueriesInitialized = true;
}

// Compute the slot index for this frame. We read this slot's previous
// contents (frame N-3's queries — the oldest data in the ring) and then
// overwrite it with this frame's queries.
int slot = _gpuQueryFrameIndex % GpuQueryRingDepth;

// Read frame N-3's result BEFORE overwriting. Gated on "we've completed
// at least one full ring of writes" so we don't read uninitialized slots
// during warm-up.
if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) {
    _gl.GetQueryObject(_gpuQueryOpaque[slot], QueryObjectParameterName.ResultAvailable, out int avail);
    if (avail != 0) {
        _gl.GetQueryObject(_gpuQueryOpaque[slot],      QueryObjectParameterName.Result, out ulong opaqueNs);
        _gl.GetQueryObject(_gpuQueryTransparent[slot], QueryObjectParameterName.Result, out ulong transNs);
        long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
        _gpuSamples[_gpuSampleCursor] = gpuUs;
        _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
    }
    // If avail==0 the sample is dropped silently. MedianMicros already
    // computes over the non-zero subset, so dropped samples don't poison
    // the median.
}

// Issue this frame's queries into the same slot — overwriting the data
// we just (attempted to) read.
if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[slot]);
 opaque indirect draw 
if (diag && _gpuQueriesInitialized) _gl.EndQuery(QueryTarget.TimeElapsed);

 same for transparent with _gpuQueryTransparent[slot] 

_gpuQueryFrameIndex++;

// Dispose: loop over the ring.

Behavior

  • Frames 0, 1, 2 issue queries but no reads happen (the >= RingDepth gate skips them).
  • Frame 3 reads frame 0's queries (oldest in ring) and writes new queries into slot 0. Frame 4 reads frame 1's, etc.
  • Steady-state: each frame's queries are read exactly once, three frames after they were issued. Frames 0/1/2's queries are intentionally lost (startup artifact, ~50 ms of measurement).
  • The diagnostic prints over a 256-frame moving window — at 200 fps that's ~1.3 s of history, so the first valid gpu_us median appears within ~2 s of moving.

Diag interaction

MaybeFlushDiag already prints every 5 s; no change there.

MedianMicros already filters non-zero samples; no change there.

The user-visible behavior change: gpu_us=Xm/Yp95 numbers in [WB-DIAG] reflect real GPU draw time for the entity dispatcher's two indirect calls.


§5. Change 2 — Surface-format histogram one-shot dump

Files touched

  • src/AcDream.App/Rendering/TextureCache.cs — add an env-gated dump method, ~40 LOC.
  • One caller in GameWindow.cs (first-frame hook) — ~5 LOC.

Trigger

Env var ACDREAM_DUMP_SURFACES=1. When set, on frame index 600 of the session (~10 s at 60 fps, ~3 s at 200 fps — both well past streaming settle at radius≤12), iterate all entries in the bindless caches (_bindlessBySurfaceId, _bindlessByOverridden, _bindlessByPalette) and emit a histogram to %LOCALAPPDATA%\acdream\n6-surfaces.txt. One-shot — fires once per session at the exact frame, no repeats. The user can re-launch to capture a fresh snapshot.

Output schema

Per entry, one line: surfaceId(uint32 hex), width(uint16), height(uint16), format(string), byteCount(uint32).

Plus rollups at the end:

  • Count by (width × height) bucket — answers "how many distinct dimension pairs?".
  • Count by source SurfaceFormat (INDEX16, BGRA, DXT1, etc.).
  • Total bytes (sum of width × height × 4 for RGBA8 uploads).
  • Top 10 most-shared (width, height, format) triples by count — this is the atlas-opportunity input.

Cost when off

Negligible — one Dictionary<uint, …> write per UploadRgba8/UploadRgba8AsLayer1Array call (the _uploadMetadata insertion is unconditional so the dump path doesn't have to query GL state when it does fire). At Holtburg with 760 textures that's ~3050 KB of process memory and one hash-table write per upload — invisible at runtime, no GC pressure. The expensive work (file I/O, histogram construction) is gated by the env-var check inside TickSurfaceHistogramDumpIfEnabled and only runs when ACDREAM_DUMP_SURFACES=1.


§6. Change 3 — Baseline document

File

docs/plans/2026-05-11-phase-n6-perf-baseline.md.

Setup section

  • Hardware: Radeon RX 9070 XT (the user's machine).
  • Resolution: 1440p.
  • Quality preset: High (default).
  • Connection: live ACE at 127.0.0.1:9000, character +Acdream at Holtburg.
  • Sky: clear midday, controlled via F7 to remove weather noise.
  • Build: Debug (matches the user's normal launch).
  • Date measured: 2026-05-11.

Measurements

Three radii: 4, 8, 12. Two motion modes per radius: standstill (camera anchored 30 s) and walking (+Acdream walks N→E→S→W across one landblock, 30 s).

Per radius/mode, capture from [WB-DIAG] and the window title:

  • CPU dispatcher: cpu_us median, p95.
  • GPU dispatcher: gpu_us median, p95 (now real).
  • FPS.
  • Entities seen / drawn.
  • Groups.
  • Frame time (window title).

Memory snapshot

One-time output from the ACDREAM_DUMP_SURFACES=1 run, summarized:

  • Total surfaces in cache.
  • Total GPU texture bytes.
  • Dimension distribution (top 10 by count).
  • Format distribution.
  • Atlas-opportunity score: percentage of surfaces in the top-3 dimension buckets.

Conclusion section

A recommendation paragraph addressing:

  1. Is the entity dispatcher CPU-bound or GPU-bound at radius=12?
  2. Does gpu_us p95 leave headroom or is the GPU saturated?
  3. Does the atlas-opportunity score justify slice-2 atlas work?
  4. Given (1)(3), what should the next phase be? Slice 2 (perf cleanup), C.1.5 (PES emitter wiring), or escalation to Tier 2 (static/dynamic split)?

The paragraph is opinionated — the next phase decision should be obvious from the numbers, not require a separate debate.


§7. Test plan

Automated tests (none new)

This slice is intentionally test-light:

  • The GPU-timing fix has no observable behavior in tests — it only changes a diagnostic readout. No new unit tests.
  • The surface-dump path is env-gated diag; no need to lock its output format in tests.
  • Existing 1688 tests must remain green. WbDrawDispatcher tests (bucketing, indirect-command construction, classification cache) must not be perturbed.

Manual verification

  1. Launch live with ACDREAM_WB_DIAG=1. Walk Holtburg for ~30 s. Confirm [WB-DIAG] prints gpu_us=Xm/Yp95 with X > 0 within ~5 s.
  2. Launch live with ACDREAM_DUMP_SURFACES=1 ACDREAM_WB_DIAG=1. Wait ~10 s for streaming to settle. Open %LOCALAPPDATA%\acdream\n6-surfaces.txt. Confirm it contains a non-empty histogram.
  3. Run the baseline measurement procedure end-to-end. Confirm the document populates with real numbers, not placeholders.

§8. Sequencing / ship gates

Commit 1 — GPU query fix

Message: feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher

Scope: WbDrawDispatcher.cs changes only. Build green, tests green, manual verification step 1 from §7 passes.

Gate: if gpu_us still reports 0 after ~10 s of movement, do NOT proceed to commit 2. Bump ring depth to 4 or investigate driver behavior before continuing.

Commit 2 — Baseline doc + surface dump

Message: docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path

Scope: TextureCache.cs dump method, GameWindow.cs hook, docs/plans/2026-05-11-phase-n6-perf-baseline.md, and the roadmap amendment at docs/plans/2026-04-11-roadmap.md lines 690-705 (split N.6 into slice 1 / slice 2 in the bullet list).

Gate: manual verification steps 2 and 3 from §7 pass; baseline document's conclusion paragraph is filled in (not "TBD"); roadmap update lands in the same commit.


§9. Acceptance criteria

  1. [WB-DIAG] reports non-zero gpu_us for the entity dispatcher's opaque+transparent passes at Holtburg radius=12 with ACDREAM_WB_DIAG=1.
  2. The fix uses only core OpenGL 3.3+ features (GL_TIME_ELAPSED, glGetQueryObject, GL_QUERY_RESULT_AVAILABLE). No vendor-specific extensions.
  3. docs/plans/2026-05-11-phase-n6-perf-baseline.md exists, contains numbers (not placeholders) for the 3 radii × 2 motion modes, contains the surface histogram summary, and closes with a recommendation paragraph.
  4. The roadmap entry at docs/plans/2026-04-11-roadmap.md:690-705 is amended to reflect the slice split.
  5. dotnet build succeeds with no new warnings.
  6. dotnet test succeeds with the existing pass/fail baseline (1688 passing, ~8 pre-existing physics/input failures unchanged).
  7. No visible regression in the rendering path — Holtburg outdoor, day/night cycle, entity rendering, transparent surfaces all look the same as before the change.

§10. Risks

Risk Likelihood Mitigation
ResultAvailable is 0 even for frame N-3 (driver queues 4+ frames ahead) Low — would be unusual on desktop GL Sample is dropped silently; diagnostic prints zeros; user reports it. Fix: bump GpuQueryRingDepth to 4. No regression in the render path itself.
Query-pair allocation leaks across init/Dispose cycles Low Dispose loop deletes the full ring; existing pattern just gains an array index.
Surface-dump path fires before streaming settles, gets a sparse picture Medium Document the procedure as "wait ~10 s after entering world before reading the file." The dump path itself can also be re-runnable if needed (deferred unless slice 1 hits this in practice).
Conclusion paragraph in the baseline document is hard to write because the numbers don't clearly favor one direction Medium — this is the slice's whole purpose Acknowledge the ambiguity in the document and propose a "slice 1 conclusion plus a short re-brainstorm with the user" flow. The slice still ships if the numbers force a re-brainstorm; the value is in having the numbers, not in pre-deciding the answer.
Hidden vendor-specific behavior in GL_TIME_ELAPSED produces non-comparable numbers across hardware Low — GL_TIME_ELAPSED is nanosecond-accurate per spec Document the measurement hardware explicitly in the baseline doc setup section so future runs on different GPUs can be tagged appropriately.

§11. Out of scope / future work

These are explicitly NOT in slice 1, listed here so the next phase has a clean shopping list:

  • Slice 2 — TextureCache cleanup. Delete orphan mesh.frag (verify zero callers post-N.5 amendment). Delete dead entity-style legacy caches (_handlesByOverridden, _handlesByPalette) that no live renderer reads. Decide on bindless-everywhere vs legacy-island for the remaining sampler2D consumers (sky, UI text, particles).
  • Slice 2 — Particle shader migration. Tied to C.1.5 outcome; particles migrate after C.1.5 lands more visible content to regression-test against.
  • Slice 2 — Persistent-mapped buffers. Conditional on slice 1's baseline showing BufferSubData as a hot spot.
  • Slice 2 — WB atlas adoption. Conditional on slice 1's surface histogram showing a real opportunity.
  • C.1.5 — PES emitter wiring. Portals, chimneys, fireplaces. Separate phase; gets its own brainstorm/spec.
  • Tier 2 — static/dynamic split with persistent groups. Separate roadmap at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md.
  • Tier 3 — GPU compute culling. Depends on Tier 2 first. Same roadmap.
  • Cross-vendor perf comparison. Slice 1 is one machine. A NVIDIA companion run is a backlog item, not in scope.

§12. References