acdream/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md
Erik a4931eeaa2 docs(perf): Phase N.6 slice 1 — implementation plan
Step-by-step plan for the two-commit slice: fix WbDrawDispatcher's
gpu_us double-buffering bug (ring-of-3 query slots, read-before-overwrite,
vendor-neutral) then capture the radius=12 baseline at Holtburg with
the now-working diagnostic. Includes exact old_string/new_string Edit
patterns for every code change, PowerShell launch + measurement
procedure for the manual baseline, baseline doc template with explicit
fill-in slots, and a per-criterion acceptance checklist.

Output companion to docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
(commit 05d590c).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 11:12:26 +02:00

39 KiB
Raw Blame History

Phase N.6 slice 1 Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Fix the broken gpu_us diagnostic in WbDrawDispatcher (vendor-neutral OpenGL query ring) and produce one authoritative perf baseline document at Holtburg radius=12 so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) is grounded in real numbers.

Architecture: Two commits. Commit 1 changes only WbDrawDispatcher.cs — replaces the two uint GL query handles with ring-of-3 arrays and moves the result read to before the next frame overwrites the slot (read frame N-3's queries, then overwrite). Commit 2 adds an env-gated surface-format histogram dump in TextureCache.cs, captures the actual measurement, writes the baseline doc, and amends the roadmap entry. No new automated tests — the GPU-timing fix has no observable behavior in tests, and the dump path is env-gated diagnostic only; verification is manual launch-and-look.

Tech Stack: C# / .NET 10, Silk.NET (OpenGL 4.3+), dotnet build / dotnet test from PowerShell, live ACE on 127.0.0.1:9000 for in-world verification.

Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (committed at 05d590c).


File Structure

File Action Responsibility
src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs Modify Replace 2 uint query handles with ring-of-3 arrays; move query result read to before next-frame overwrite.
src/AcDream.App/Rendering/TextureCache.cs Modify Add upload-time dimension/format tracking + env-gated TickSurfaceHistogramDumpIfEnabled() method that fires once at frame 600.
src/AcDream.App/Rendering/GameWindow.cs Modify Call _textureCache.TickSurfaceHistogramDumpIfEnabled() once per frame in OnRender.
docs/plans/2026-05-11-phase-n6-perf-baseline.md Create Baseline measurement doc: setup, numbers at radii 4/8/12 (standstill + walking), surface histogram summary, conclusion paragraph recommending next phase.
docs/plans/2026-04-11-roadmap.md lines 690-705 Modify Amend N.6 entry to reflect the slice 1 / slice 2 split.

Task 1: GPU query ring buffering (commit 1)

Files:

  • Modify: src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs

The five edit zones are well-isolated by exact strings. Apply them in order — do NOT reorder; the build won't fail mid-way but the resulting code is easier to review if applied as documented.

  • Step 1.1: Replace the field declarations (~line 155)

Use Edit to replace the existing field block:

old_string:

    private uint _gpuQueryOpaque;
    private uint _gpuQueryTransparent;
    private readonly long[] _gpuSamples = new long[256];   // microseconds
    private int _gpuSampleCursor;
    private bool _gpuQueriesInitialized;

new_string:

    // GPU timing uses a ring of 3 query-pair slots so the read of frame N-3's
    // result lands when the GPU has finished (~50ms after issue on a typical
    // 60fps frame). Ring of 3 is the vendor-neutral choice: NVIDIA drivers with
    // triple-buffering+vsync can queue ~3 frames ahead, AMD typically 1-2,
    // Intel iGPUs vary. ResultAvailable is the safety guard if the GPU is
    // still working when we try to read.
    private const int GpuQueryRingDepth = 3;
    private readonly uint[] _gpuQueryOpaque      = new uint[GpuQueryRingDepth];
    private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth];
    private int _gpuQueryFrameIndex;
    private readonly long[] _gpuSamples = new long[256];   // microseconds
    private int _gpuSampleCursor;
    private bool _gpuQueriesInitialized;
  • Step 1.2: Replace the init block (~line 347)

old_string:

        if (diag && !_gpuQueriesInitialized)
        {
            _gpuQueryOpaque      = _gl.GenQuery();
            _gpuQueryTransparent = _gl.GenQuery();
            _gpuQueriesInitialized = true;
        }

new_string:

        if (diag && !_gpuQueriesInitialized)
        {
            for (int i = 0; i < GpuQueryRingDepth; i++)
            {
                _gpuQueryOpaque[i]      = _gl.GenQuery();
                _gpuQueryTransparent[i] = _gl.GenQuery();
            }
            _gpuQueriesInitialized = true;
        }
  • Step 1.3: Insert the read-before-overwrite block + compute slot just before the opaque query begin (~line 774)

This step replaces the existing single-line BeginQuery for opaque with a block that first computes the slot, reads the slot's frame N-3 result (gated on having completed one ring), then issues the new query into the same slot.

old_string:

            _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque);

new_string:

            _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);

            // GPU timing: compute this frame's ring slot. We read frame N-3's
            // result (the oldest data in the ring) before overwriting it with
            // frame N's queries. See spec §3 Q1/Q2 + §4 in
            // docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
            int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth;
            if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth)
            {
                _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail);
                if (avail != 0)
                {
                    _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot],      QueryObjectParameterName.Result, out ulong opaqueNs);
                    _gl.GetQueryObject(_gpuQueryTransparent[gpuQuerySlot], QueryObjectParameterName.Result, out ulong transNs);
                    long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
                    _gpuSamples[_gpuSampleCursor] = gpuUs;
                    _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
                }
                // If avail==0 the sample is dropped silently. MedianMicros
                // computes over the non-zero subset, so dropped samples don't
                // poison the median.
            }

            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[gpuQuerySlot]);
  • Step 1.4: Update the transparent query begin to use the same slot (~line 823)

old_string:

            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent);

new_string:

            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent[gpuQuerySlot]);
  • Step 1.5: Replace the buggy in-frame read block + increment frame counter (~line 849)

old_string:

            // Read GPU samples non-blocking; the result for the previous frame's
            // queries should be ready by now. If not, drop the sample (don't stall
            // the CPU waiting for the GPU).
            if (_gpuQueriesInitialized)
            {
                _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail);
                if (avail != 0)
                {
                    _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs);
                    _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs);
                    long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
                    _gpuSamples[_gpuSampleCursor] = gpuUs;
                    _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
                }
            }

            _drawsIssued     += _opaqueDrawCount + _transparentDrawCount;

new_string:

            // GPU sample read happens BEFORE issuing the next frame's queries
            // (see step 1.3 above). Increment the frame counter here so the
            // next call computes a fresh slot.
            if (_gpuQueriesInitialized) _gpuQueryFrameIndex++;

            _drawsIssued     += _opaqueDrawCount + _transparentDrawCount;
  • Step 1.6: Update Dispose to delete the full ring (~line 1140)

old_string:

        if (_gpuQueriesInitialized)
        {
            _gl.DeleteQuery(_gpuQueryOpaque);
            _gl.DeleteQuery(_gpuQueryTransparent);
        }

new_string:

        if (_gpuQueriesInitialized)
        {
            for (int i = 0; i < GpuQueryRingDepth; i++)
            {
                _gl.DeleteQuery(_gpuQueryOpaque[i]);
                _gl.DeleteQuery(_gpuQueryTransparent[i]);
            }
        }
  • Step 1.7: Build

Run from the worktree root:

dotnet build

Expected: build succeeds with no new warnings or errors. If the build fails, the most likely cause is a missed string in one of the steps above — re-grep _gpuQueryOpaque and _gpuQueryTransparent in WbDrawDispatcher.cs and confirm every reference uses the array-indexed form [gpuQuerySlot] or [i].

  • Step 1.8: Run the test suite
dotnet test --no-build

Expected: same pass/fail baseline as before the change (~1688 passing, ~8 pre-existing physics/input failures unchanged). No new failures.

  • Step 1.9: Manual verification — launch live and confirm gpu_us reports non-zero
$env:ACDREAM_DAT_DIR   = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE      = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG   = "1"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task1-verify.log"

In-world: walk Holtburg for ~30 seconds. Close the window when done.

Verification check on task1-verify.log:

Select-String -Path task1-verify.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 5

Expected output: at least one [WB-DIAG] line where gpu_us=Xm/Yp95 has X > 0 (typically tens to low-hundreds of microseconds at radius=4-12 on a modern GPU). If gpu_us=0m/0p95 persists for the entire run, the fix didn't take — check whether the build actually rebuilt (try dotnet build -c Debug then re-launch).

Also confirm: no visible regression in the client. Entities render, animations play, sky cycles. Close the client cleanly.

  • Step 1.10: Commit
git add src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs
git commit -m @'
feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher

The dispatcher's GPU TimeElapsed queries were polled in the same frame
as the indirect draw, so glGetQueryObject(ResultAvailable) always
returned 0 and gpu_us in [WB-DIAG] was stuck at 0m/0p95.

Replace the 2 single-handle queries with ring-of-3 arrays and move the
result read to BEFORE issuing the next frame's queries into the same
slot — at frame N we read slot N%3 which holds frame N-3's queries
(oldest in the ring, ~50ms old at 60fps and definitely done across all
desktop GL drivers). Vendor-neutral: AMD/NVIDIA/Intel desktop GL all
work without driver-specific code.

No new tests — the change is purely a diagnostic readout fix, no
observable behavior in the rendering path. Manual verification:
[WB-DIAG] now reports non-zero gpu_us at Holtburg radius=12.

Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'@
git status

Expected: clean working tree after commit. Note the new commit SHA — needed for the baseline doc's "measured against" reference.


Task 2: Surface-format histogram dump path (part of commit 2 setup)

Files:

  • Modify: src/AcDream.App/Rendering/TextureCache.cs
  • Modify: src/AcDream.App/Rendering/GameWindow.cs

This task adds the env-gated one-shot dump infrastructure. It does NOT commit — the commit happens in Task 4 after the baseline document is also ready.

  • Step 2.1: Add upload-time metadata tracking in TextureCache.cs

Add a new private dictionary that records (width, height, formatLabel) keyed by GL texture name. This lets DumpSurfaceHistogram emit dimension/format data without re-querying GL.

Use Edit to insert the field right after the existing bindless cache fields (~line 41, just after _bindlessByPalette):

old_string:

    private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();

    public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)

new_string:

    private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();

    // Phase N.6 slice 1 (2026-05-11): per-upload metadata for the
    // ACDREAM_DUMP_SURFACES=1 histogram dump path. Populated at upload
    // time so the dump method doesn't have to query GL state. Keyed by
    // GL texture name (same key used in cache value tuples). Format
    // label is "RGBA8_DECODED" for the post-decode upload (all uploads
    // currently land as RGBA8 regardless of source format).
    private readonly Dictionary<uint, (int Width, int Height, string Format)> _uploadMetadata = new();

    // Frame counter for the one-shot ACDREAM_DUMP_SURFACES=1 trigger.
    // Increments per Tick call; fires the dump once at frame index 600
    // and never again for the session. See spec §5.
    private int _dumpFrameCounter;
    private bool _surfaceHistogramAlreadyDumped;

    public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)
  • Step 2.2: Find the UploadRgba8AsLayer1Array method and record metadata there

Locate the method using Grep:

pattern: "UploadRgba8AsLayer1Array"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: content
-n: true

Read the method body (typically ~30-50 lines) to find the exact return name; line. The decoded texture has decoded.Width, decoded.Height, and decoded.Rgba8 available.

For each return name; in UploadRgba8AsLayer1Array(DecodedTexture decoded), insert this line immediately before it:

        _uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED");

If the method has only one return name; near its end, that's a single Edit. Use the surrounding 2-3 lines of context in old_string to make the Edit unique.

  • Step 2.3: Also record metadata in the legacy UploadRgba8 (non-bindless) path

Locate the method:

pattern: "private uint UploadRgba8\b"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: content
-n: true

Apply the same _uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED"); insertion before each return name; in UploadRgba8(DecodedTexture decoded). This ensures the dump captures both legacy and modern uploads.

  • Step 2.4: Add the TickSurfaceHistogramDumpIfEnabled public method to TextureCache.cs

Locate HashPaletteOverride using Grep:

pattern: "internal static ulong HashPaletteOverride"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: content
-n: true
-A: 20

Identify its closing brace. Use Edit with surrounding context to insert the new methods immediately after.

old_string: (the last few lines of HashPaletteOverride):

        foreach (var sp in p.SubPalettes)
        {
            h = (h ^ sp.SubPaletteId) * prime;
            h = (h ^ sp.Offset) * prime;
            h = (h ^ sp.Length) * prime;
        }
        return h;
    }

new_string:

        foreach (var sp in p.SubPalettes)
        {
            h = (h ^ sp.SubPaletteId) * prime;
            h = (h ^ sp.Offset) * prime;
            h = (h ^ sp.Length) * prime;
        }
        return h;
    }

    /// <summary>
    /// Phase N.6 slice 1: one-shot surface-format histogram dump for the
    /// atlas-opportunity audit. Activated by ACDREAM_DUMP_SURFACES=1; fires
    /// once at frame 600 of the session (~10s at 60fps, ~3s at 200fps —
    /// both well past streaming settle at radius≤12). Output goes to
    /// %LOCALAPPDATA%\acdream\n6-surfaces.txt. Zero cost when off.
    /// See spec §5 in docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
    /// </summary>
    public void TickSurfaceHistogramDumpIfEnabled()
    {
        if (_surfaceHistogramAlreadyDumped) return;
        if (!string.Equals(Environment.GetEnvironmentVariable("ACDREAM_DUMP_SURFACES"), "1", StringComparison.Ordinal)) return;
        _dumpFrameCounter++;
        if (_dumpFrameCounter < 600) return;

        DumpSurfaceHistogram();
        _surfaceHistogramAlreadyDumped = true;
    }

    private void DumpSurfaceHistogram()
    {
        var localAppData = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData);
        var outDir = System.IO.Path.Combine(localAppData, "acdream");
        System.IO.Directory.CreateDirectory(outDir);
        var outPath = System.IO.Path.Combine(outDir, "n6-surfaces.txt");

        var sb = new System.Text.StringBuilder();
        sb.AppendLine($"# acdream surface-format histogram — generated {DateTime.UtcNow:yyyy-MM-ddTHH:mm:ssZ}");
        sb.AppendLine("# Per-entry: surfaceId(hex), width, height, format, byteCount");
        sb.AppendLine();

        // Walk every cached entry across the 6 caches, dedupe by GL name.
        var seen = new HashSet<uint>();
        long totalBytes = 0;
        var bucketsByDim = new Dictionary<(int W, int H), int>();
        var bucketsByFormat = new Dictionary<string, int>();
        var bucketsByTriple = new Dictionary<(int W, int H, string F), int>();

        void Emit(uint surfaceId, uint name)
        {
            if (!seen.Add(name)) return;
            if (!_uploadMetadata.TryGetValue(name, out var meta)) return;
            int bytes = meta.Width * meta.Height * 4;
            totalBytes += bytes;
            sb.AppendLine($"0x{surfaceId:X8}, {meta.Width}, {meta.Height}, {meta.Format}, {bytes}");

            var dimKey = (meta.Width, meta.Height);
            bucketsByDim[dimKey] = bucketsByDim.GetValueOrDefault(dimKey) + 1;
            bucketsByFormat[meta.Format] = bucketsByFormat.GetValueOrDefault(meta.Format) + 1;
            var tripleKey = (meta.Width, meta.Height, meta.Format);
            bucketsByTriple[tripleKey] = bucketsByTriple.GetValueOrDefault(tripleKey) + 1;
        }

        foreach (var kv in _handlesBySurfaceId)         Emit(kv.Key, kv.Value);
        foreach (var kv in _handlesByOverridden)        Emit(kv.Key.surfaceId, kv.Value);
        foreach (var kv in _handlesByPalette)           Emit(kv.Key.surfaceId, kv.Value);
        foreach (var kv in _bindlessBySurfaceId)        Emit(kv.Key, kv.Value.Name);
        foreach (var kv in _bindlessByOverridden)       Emit(kv.Key.surfaceId, kv.Value.Name);
        foreach (var kv in _bindlessByPalette)          Emit(kv.Key.surfaceId, kv.Value.Name);

        sb.AppendLine();
        sb.AppendLine("# Rollups");
        sb.AppendLine($"# Total unique GL textures: {seen.Count}");
        sb.AppendLine($"# Total bytes (sum of W*H*4): {totalBytes}");

        sb.AppendLine("# Top 10 (W,H) dimension buckets:");
        foreach (var kv in bucketsByDim.OrderByDescending(kv => kv.Value).Take(10))
            sb.AppendLine($"#   {kv.Key.W}x{kv.Key.H}: {kv.Value}");

        sb.AppendLine("# Format buckets:");
        foreach (var kv in bucketsByFormat.OrderByDescending(kv => kv.Value))
            sb.AppendLine($"#   {kv.Key}: {kv.Value}");

        sb.AppendLine("# Top 10 (W,H,format) triples — atlas-opportunity input:");
        foreach (var kv in bucketsByTriple.OrderByDescending(kv => kv.Value).Take(10))
            sb.AppendLine($"#   {kv.Key.W}x{kv.Key.H} {kv.Key.F}: {kv.Value}");

        System.IO.File.WriteAllText(outPath, sb.ToString());
        Console.WriteLine($"[N6-DUMP] Surface histogram written to {outPath} ({seen.Count} textures, {totalBytes} bytes)");
    }
  • Step 2.5: Confirm using System.Linq; is present in TextureCache.cs

Read the file's using section (top of file). If using System.Linq; is NOT present, add it. The OrderByDescending and Take calls in DumpSurfaceHistogram need it.

Pattern:

pattern: "^using System\.Linq"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: count

If count is 0, add using System.Linq; in alphabetical order with the other usings at the top of the file.

  • Step 2.6: Add the per-frame call site in GameWindow.cs

Find a stable insertion point near the top of OnRender (starts at line 6288). Use Grep:

pattern: "_gl!\.Clear\("
path: src/AcDream.App/Rendering/GameWindow.cs
output_mode: content
-n: true
-A: 3

This finds the Clear call(s) in or near OnRender. The first one after line 6288 is where you want to insert. Read 5 lines of context around it, then Edit to insert the dump tick on the line immediately after the Clear call returns:

The insertion (one Edit):

old_string: (find the Clear call in OnRender and capture 1-2 lines of its context — varies; common pattern is _gl!.Clear(ClearBufferMask.ColorBufferBit | ClearBufferMask.DepthBufferBit); followed by the next line of OnRender work).

new_string: the same Clear call followed by:


        // Phase N.6 slice 1: one-shot surface-format histogram dump under
        // ACDREAM_DUMP_SURFACES=1. Zero cost when off.
        _textureCache?.TickSurfaceHistogramDumpIfEnabled();

If OnRender has multiple Clear calls, place the tick after the first one inside the method body. The call must run exactly once per frame, before any rendering work — placing it right after Clear accomplishes both.

  • Step 2.7: Build
dotnet build

Expected: build succeeds with no new warnings. If a "name 'OrderByDescending' does not exist in current context" error appears, Step 2.5 was missed — add the using System.Linq; and rebuild.

  • Step 2.8: Run the test suite
dotnet test --no-build

Expected: same pass/fail baseline (~1688 passing, ~8 pre-existing failures). No new failures.

  • Step 2.9: Manual verification — confirm the dump file appears

Launch with the dump env var on:

$env:ACDREAM_DUMP_SURFACES = "1"
$env:ACDREAM_WB_DIAG = "1"
# Other env vars same as Task 1 Step 1.9
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task2-verify.log"

Wait ~15 seconds after the window appears, then close it. Check the file:

Get-Content "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" | Select-Object -First 30

Expected: a non-empty file with the header, per-entry rows, and rollup sections. Also confirm one [N6-DUMP] Surface histogram written to ... line in task2-verify.log (just before window close).

If the file is empty or missing:

  • Check the launch log for the [N6-DUMP] line.
  • If it's not there, _dumpFrameCounter didn't reach 600 — the user closed too early. Re-run and wait longer.
  • If it's there but the file lookup fails, the path output in the log should show what was actually written; investigate that path.

Do not commit yet. Continue to Task 3.


Task 3: Capture baseline measurements

Files:

  • Create: docs/plans/2026-05-11-phase-n6-perf-baseline.md (final content lands in Task 4 — this task just collects the numbers).

This is the manual measurement task. Each step launches the client, runs a specific scenario, and captures the diagnostic output. Save each log separately for the final write-up. Total expected time: ~30-45 min.

Setup once per session:

$env:ACDREAM_DAT_DIR   = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE      = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG   = "1"

For each measurement run, set ACDREAM_STREAM_RADIUS before launch. Use the QualityPreset=High default (no overrides). All runs at Holtburg with +Acdream at clear midday (cycle weather with F10 → Clear, time with F7 → Noon).

Per run, after ~30 seconds at the target condition, close the window and grep the log for the last 3 [WB-DIAG] lines — those have the steady-state numbers.

  • Step 3.1: Capture radius=4 standstill
$env:ACDREAM_STREAM_RADIUS = "4"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-stand.log"

In-world: enter world, do not move, hold position for 30 seconds. Close.

Select-String -Path baseline-r4-stand.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3

Record from the median of the last 3 lines: cpu_us, gpu_us, entSeen, entDrawn, groups. Also note the window-title FPS shown during the test.

  • Step 3.2: Capture radius=4 walking
$env:ACDREAM_STREAM_RADIUS = "4"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-walk.log"

In-world: enter world, Tab to player mode, walk N→E→S→W across one landblock over ~30 seconds. Close.

Capture same numbers as 3.1.

  • Step 3.3: Capture radius=8 standstill
$env:ACDREAM_STREAM_RADIUS = "8"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-stand.log"

Same procedure as 3.1. Wait ~40 seconds before recording (streaming takes longer to settle).

  • Step 3.4: Capture radius=8 walking
$env:ACDREAM_STREAM_RADIUS = "8"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-walk.log"

Same procedure as 3.2.

  • Step 3.5: Capture radius=12 standstill
$env:ACDREAM_STREAM_RADIUS = "12"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-stand.log"

Same procedure as 3.1. Wait ~60 seconds before recording. This is the headline measurement — pay attention to whether gpu_us p95 is well below 16.6 ms (60 fps target) or pushing it.

  • Step 3.6: Capture radius=12 walking
$env:ACDREAM_STREAM_RADIUS = "12"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-walk.log"

Same procedure as 3.2 (walking across one landblock, ~30 seconds of motion within the 60s+ window).

  • Step 3.7: Capture the surface histogram
$env:ACDREAM_STREAM_RADIUS = "12"
$env:ACDREAM_DUMP_SURFACES = "1"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-surfaces.log"

In-world: enter world at Holtburg, do nothing for ~30 seconds (let the dump fire at frame 600). Close. Copy the file:

Copy-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -Destination "baseline-surfaces.txt"

Inspect:

Get-Content baseline-surfaces.txt | Select-Object -Last 40

Record the rollup section (total textures, total bytes, top 10 dimension buckets, format distribution, top 10 (W,H,format) triples).

  • Step 3.8: Clean up the env vars and the local app data dump
Remove-Item Env:\ACDREAM_DUMP_SURFACES -ErrorAction SilentlyContinue
Remove-Item Env:\ACDREAM_STREAM_RADIUS -ErrorAction SilentlyContinue
# Optional: clean up the source file so a future re-measurement isn't confused by stale data
Remove-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -ErrorAction SilentlyContinue

All log files (baseline-r*-*.log, baseline-surfaces.log, baseline-surfaces.txt) remain in the worktree root for Task 4. They will NOT be committed — they're scratch.


Task 4: Write baseline doc + amend roadmap + ship commit 2

Files:

  • Create: docs/plans/2026-05-11-phase-n6-perf-baseline.md

  • Modify: docs/plans/2026-04-11-roadmap.md lines 690-705

  • Step 4.1: Write the baseline document

Use Write to create docs/plans/2026-05-11-phase-n6-perf-baseline.md with this content (substitute real numbers from Task 3 captures into every <n> and <pct> placeholder; do NOT leave any unfilled):

# Phase N.6 slice 1 — perf baseline at Holtburg

**Created:** 2026-05-11.
**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md)
**Measured against commit:** <commit SHA from Task 1.10>
**Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.

---

## §1. Setup

- **Hardware:** Radeon RX 9070 XT
- **Resolution:** 1440p (2560×1440)
- **Quality preset:** High (default)
- **Connection:** live ACE at `127.0.0.1:9000`
- **Character:** `+Acdream` at Holtburg
- **Sky / time:** clear midday (F7 → Noon, F10 → Clear)
- **Build:** Debug
- **Date measured:** 2026-05-11
- **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=<per-run>`

## §2. Dispatch CPU / GPU numbers

Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window. `entSeen / entDrawn / groups` are also from those lines. FPS read from the window title.

| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | FPS | entSeen | entDrawn | groups |
|---|---|---|---|---|---|---|---|---|---|
| 4 | standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 4 | walking    | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 8 | standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 8 | walking    | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 12| standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 12| walking    | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |

## §3. Surface-format histogram

From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world.

- **Total unique GL textures:** <n>
- **Total bytes (sum of W*H*4):** <n>
- **Top 10 (W, H) dimension buckets:**
  - `<W>x<H>`: <count>
  - ... (paste from baseline-surfaces.txt rollup)
- **Format distribution:**
  - `<format>`: <count>
- **Top 10 (W, H, format) triples — atlas-opportunity input:**
  - `<W>x<H> <format>`: <count>
  - ...

**Atlas-opportunity score:** <pct>% of surfaces fall into the top-3 (W, H, format) triples. (A score >30% means atlas consolidation could meaningfully reduce sampler switches + memory overhead; <15% means scattered content and atlas is not worth the slice-2 effort.)

## §4. Conclusion + next-phase recommendation

<Opinionated paragraph addressing:
 1. Is the entity dispatcher CPU-bound or GPU-bound at radius=12?
    - Compare cpu_us p95 vs gpu_us p95. The larger one is the bottleneck.
 2. Does gpu_us p95 leave headroom at 60 fps target (16.6 ms / 16600 µs)?
    - If gpu_us p95 < 8000 µs: comfortable headroom.
    - If gpu_us p95 < 14000 µs: tight but OK.
    - If gpu_us p95 >= 14000 µs: GPU-saturated, persistent-mapped buffers and compute cull help.
 3. Does the atlas score justify slice-2 atlas work?
 4. Given (1)-(3), which is the right next phase?
    - CPU-bound + low atlas score: pivot to C.1.5 (visible content, perf already comfortable).
    - GPU-bound + high atlas score: do N.6 slice 2 (atlas + persistent buffers).
    - Either-bound + headroom + low atlas score: do C.1.5 first.
    - GPU saturated + need for more headroom: escalate to Tier 2.>

## §5. Raw logs

Scratch logs from this measurement run (not committed):
- `baseline-r4-stand.log`, `baseline-r4-walk.log`
- `baseline-r8-stand.log`, `baseline-r8-walk.log`
- `baseline-r12-stand.log`, `baseline-r12-walk.log`
- `baseline-surfaces.log`, `baseline-surfaces.txt`

Fill in every <n> and <pct> and the conclusion paragraph with the real values from Task 3. Do NOT leave any <n> placeholders. If a measurement is missing, re-run that step from Task 3 before continuing.

  • Step 4.2: Read the current roadmap N.6 entry
Read offset 685, limit 25 from docs/plans/2026-04-11-roadmap.md

Confirm the bullet starts with - **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).** and ends with Plan + spec written when work begins. **Estimate: 1-2 weeks.**. Capture the exact text verbatim for Step 4.3's old_string.

  • Step 4.3: Amend the roadmap entry

Use Edit. The change splits N.6 into slice 1 (shipping with this commit) and slice 2 (deferred until after C.1.5).

old_string: the exact N.6 bullet copied from the Read in Step 4.2.

new_string:

- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.**
  Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
  query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
  desktop GL). Added env-gated surface-format histogram dump in `TextureCache`
  for atlas-opportunity audit. Captured authoritative baseline at Holtburg
  radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us`
  diagnostic. Plan + spec at `docs/superpowers/{specs,plans}/2026-05-11-phase-n6-slice1-*.md`.
  Baseline numbers + next-phase recommendation at
  [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md).
- **N.6 slice 2 — Perf polish cleanup.** **Planned — deferred until after C.1.5
  (PES emitter wiring) per the baseline doc's recommendation.** Builds on
  slice 1's measurement. Scope: retire the legacy `Texture2D`/`sampler2D` path
  in `TextureCache` (currently kept for Sky + Debug + particle paths now that
  Terrain has migrated); delete orphan `mesh.frag` (verify zero callers post-N.5
  amendment); decide bindless-everywhere vs legacy-island for the remaining
  `sampler2D` consumers; conditionally adopt WB atlas if the slice-1 histogram
  shows a real opportunity; conditionally adopt persistent-mapped buffers if
  the slice-1 baseline shows `BufferSubData` as a hot spot; GPU compute culling
  remains out-of-scope (that's Tier 3 of the perf-tiers roadmap, gated on
  Tier 2 first). Plan + spec written when work begins. **Estimate: 1-2 weeks
  once C.1.5 lands.**
  • Step 4.4: Build (sanity check — only docs touched, but be safe)
dotnet build

Expected: build succeeds. (No code touched in Task 4; this just confirms nothing was accidentally edited in src/.)

  • Step 4.5: Commit 2
git add src/AcDream.App/Rendering/TextureCache.cs `
        src/AcDream.App/Rendering/GameWindow.cs `
        docs/plans/2026-05-11-phase-n6-perf-baseline.md `
        docs/plans/2026-04-11-roadmap.md
git commit -m @'
docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path

Capture authoritative CPU+GPU dispatch numbers at Holtburg with the
gpu_us diagnostic now working (commit <prev SHA from Task 1.10>). Three
radii (4/8/12) × two motion modes (standstill/walking) + a surface-format
histogram from ACDREAM_DUMP_SURFACES=1.

Adds env-gated one-shot dump path (TextureCache.TickSurfaceHistogramDumpIfEnabled,
called from GameWindow.OnRender) that fires once at frame 600 of the
session — zero cost when off, writes to %LOCALAPPDATA%\acdream\n6-surfaces.txt.

Baseline document at docs/plans/2026-05-11-phase-n6-perf-baseline.md
closes with a recommendation paragraph for the next phase. Roadmap entry
amended to reflect the slice 1 / slice 2 split.

Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§5, §6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'@
git status

Expected: clean working tree.

  • Step 4.6: Final sanity sweep
git log -3 --oneline

Expected: two new commits from this slice (the GPU timing fix from Task 1.10, then this docs/perf commit), under the spec commit 05d590c.

Also confirm the scratch baseline-r*.log and baseline-surfaces.* files are still NOT in the commit (they were not staged):

git status

Expected: clean working tree. If the scratch logs show as untracked but uncommitted, that's fine — they can be deleted manually:

Remove-Item baseline-r*.log, baseline-surfaces.log, baseline-surfaces.txt, task1-verify.log, task2-verify.log -ErrorAction SilentlyContinue

Acceptance check (spec §9)

After Task 4 commits, walk through the spec's acceptance criteria and confirm each one. This is a paper-walk, not a re-run — the steps above produce the conditions.

  • A1: [WB-DIAG] reports non-zero gpu_us at radius=12. Verified in Task 1.9 (initial check) and Task 3.5-3.6 (full baseline run). Confirm by re-grepping baseline-r12-stand.log:

    Select-String -Path baseline-r12-stand.log -Pattern "gpu_us=[1-9]"
    

    Should return at least one line.

  • A2: Vendor-neutral. No GL_*_NV or GL_*_AMD or GL_*_INTEL extension references in the change. Re-grep:

    Select-String -Path src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs -Pattern "NV_|AMD_|INTEL_|GL_NV|GL_AMD|GL_INTEL"
    

    Expected: no matches in the new code (matches elsewhere in the file from unrelated existing code don't count).

  • A3: Baseline doc has real numbers + conclusion. Open docs/plans/2026-05-11-phase-n6-perf-baseline.md and visually confirm no <n>, <pct>, TBD, or empty conclusion section.

  • A4: Roadmap split shipped.

    Select-String -Path docs/plans/2026-04-11-roadmap.md -Pattern "N\.6 slice"
    

    Expected: two matches (slice 1 + slice 2 bullets).

  • A5: dotnet build green, no new warnings.

    dotnet build
    

    Expected: succeeds. Note any new warnings vs the build output before the slice started.

  • A6: dotnet test green at baseline (~1688 passing, ~8 pre-existing failures).

    dotnet test --no-build
    

    Expected: pass count unchanged from before the slice started; failure list unchanged.

  • A7: No visible regression. Confirmed during Task 1.9 and Task 3 measurements — the user was in-world repeatedly and didn't observe any rendering issue. If anything looked off during measurement, file it as an issue and decide whether it blocks slice 1 acceptance.

If any acceptance criterion fails, return to the relevant task and re-do it. Do not declare slice 1 complete with failing acceptance.


After slice 1 lands

The baseline document's conclusion paragraph (§4) determines the next phase:

The choice is data-driven; the recommendation paragraph is the contract. Don't re-litigate the decision once the numbers are in.