acdream/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md
Erik a4931eeaa2 docs(perf): Phase N.6 slice 1 — implementation plan
Step-by-step plan for the two-commit slice: fix WbDrawDispatcher's
gpu_us double-buffering bug (ring-of-3 query slots, read-before-overwrite,
vendor-neutral) then capture the radius=12 baseline at Holtburg with
the now-working diagnostic. Includes exact old_string/new_string Edit
patterns for every code change, PowerShell launch + measurement
procedure for the manual baseline, baseline doc template with explicit
fill-in slots, and a per-criterion acceptance checklist.

Output companion to docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
(commit 05d590c).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 11:12:26 +02:00

912 lines
39 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase N.6 slice 1 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Fix the broken `gpu_us` diagnostic in `WbDrawDispatcher` (vendor-neutral OpenGL query ring) and produce one authoritative perf baseline document at Holtburg radius=12 so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) is grounded in real numbers.
**Architecture:** Two commits. Commit 1 changes only `WbDrawDispatcher.cs` — replaces the two `uint` GL query handles with ring-of-3 arrays and moves the result read to *before* the next frame overwrites the slot (read frame N-3's queries, then overwrite). Commit 2 adds an env-gated surface-format histogram dump in `TextureCache.cs`, captures the actual measurement, writes the baseline doc, and amends the roadmap entry. No new automated tests — the GPU-timing fix has no observable behavior in tests, and the dump path is env-gated diagnostic only; verification is manual launch-and-look.
**Tech Stack:** C# / .NET 10, Silk.NET (OpenGL 4.3+), `dotnet build` / `dotnet test` from PowerShell, live ACE on `127.0.0.1:9000` for in-world verification.
**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../specs/2026-05-11-phase-n6-slice1-design.md) (committed at `05d590c`).
---
## File Structure
| File | Action | Responsibility |
|---|---|---|
| [`src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs) | Modify | Replace 2 `uint` query handles with ring-of-3 arrays; move query result read to before next-frame overwrite. |
| [`src/AcDream.App/Rendering/TextureCache.cs`](../../../src/AcDream.App/Rendering/TextureCache.cs) | Modify | Add upload-time dimension/format tracking + env-gated `TickSurfaceHistogramDumpIfEnabled()` method that fires once at frame 600. |
| [`src/AcDream.App/Rendering/GameWindow.cs`](../../../src/AcDream.App/Rendering/GameWindow.cs) | Modify | Call `_textureCache.TickSurfaceHistogramDumpIfEnabled()` once per frame in `OnRender`. |
| `docs/plans/2026-05-11-phase-n6-perf-baseline.md` | Create | Baseline measurement doc: setup, numbers at radii 4/8/12 (standstill + walking), surface histogram summary, conclusion paragraph recommending next phase. |
| [`docs/plans/2026-04-11-roadmap.md`](../../plans/2026-04-11-roadmap.md) lines 690-705 | Modify | Amend N.6 entry to reflect the slice 1 / slice 2 split. |
---
## Task 1: GPU query ring buffering (commit 1)
**Files:**
- Modify: `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`
The five edit zones are well-isolated by exact strings. Apply them in order — do NOT reorder; the build won't fail mid-way but the resulting code is easier to review if applied as documented.
- [ ] **Step 1.1: Replace the field declarations (~line 155)**
Use Edit to replace the existing field block:
**old_string:**
```csharp
private uint _gpuQueryOpaque;
private uint _gpuQueryTransparent;
private readonly long[] _gpuSamples = new long[256]; // microseconds
private int _gpuSampleCursor;
private bool _gpuQueriesInitialized;
```
**new_string:**
```csharp
// GPU timing uses a ring of 3 query-pair slots so the read of frame N-3's
// result lands when the GPU has finished (~50ms after issue on a typical
// 60fps frame). Ring of 3 is the vendor-neutral choice: NVIDIA drivers with
// triple-buffering+vsync can queue ~3 frames ahead, AMD typically 1-2,
// Intel iGPUs vary. ResultAvailable is the safety guard if the GPU is
// still working when we try to read.
private const int GpuQueryRingDepth = 3;
private readonly uint[] _gpuQueryOpaque = new uint[GpuQueryRingDepth];
private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth];
private int _gpuQueryFrameIndex;
private readonly long[] _gpuSamples = new long[256]; // microseconds
private int _gpuSampleCursor;
private bool _gpuQueriesInitialized;
```
- [ ] **Step 1.2: Replace the init block (~line 347)**
**old_string:**
```csharp
if (diag && !_gpuQueriesInitialized)
{
_gpuQueryOpaque = _gl.GenQuery();
_gpuQueryTransparent = _gl.GenQuery();
_gpuQueriesInitialized = true;
}
```
**new_string:**
```csharp
if (diag && !_gpuQueriesInitialized)
{
for (int i = 0; i < GpuQueryRingDepth; i++)
{
_gpuQueryOpaque[i] = _gl.GenQuery();
_gpuQueryTransparent[i] = _gl.GenQuery();
}
_gpuQueriesInitialized = true;
}
```
- [ ] **Step 1.3: Insert the read-before-overwrite block + compute slot just before the opaque query begin (~line 774)**
This step replaces the existing single-line `BeginQuery` for opaque with a block that first computes the slot, reads the slot's frame N-3 result (gated on having completed one ring), then issues the new query into the same slot.
**old_string:**
```csharp
_gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque);
```
**new_string:**
```csharp
_gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
// GPU timing: compute this frame's ring slot. We read frame N-3's
// result (the oldest data in the ring) before overwriting it with
// frame N's queries. See spec §3 Q1/Q2 + §4 in
// docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth;
if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth)
{
_gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail);
if (avail != 0)
{
_gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.Result, out ulong opaqueNs);
_gl.GetQueryObject(_gpuQueryTransparent[gpuQuerySlot], QueryObjectParameterName.Result, out ulong transNs);
long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
_gpuSamples[_gpuSampleCursor] = gpuUs;
_gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
}
// If avail==0 the sample is dropped silently. MedianMicros
// computes over the non-zero subset, so dropped samples don't
// poison the median.
}
if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[gpuQuerySlot]);
```
- [ ] **Step 1.4: Update the transparent query begin to use the same slot (~line 823)**
**old_string:**
```csharp
if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent);
```
**new_string:**
```csharp
if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent[gpuQuerySlot]);
```
- [ ] **Step 1.5: Replace the buggy in-frame read block + increment frame counter (~line 849)**
**old_string:**
```csharp
// Read GPU samples non-blocking; the result for the previous frame's
// queries should be ready by now. If not, drop the sample (don't stall
// the CPU waiting for the GPU).
if (_gpuQueriesInitialized)
{
_gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail);
if (avail != 0)
{
_gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs);
_gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs);
long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
_gpuSamples[_gpuSampleCursor] = gpuUs;
_gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
}
}
_drawsIssued += _opaqueDrawCount + _transparentDrawCount;
```
**new_string:**
```csharp
// GPU sample read happens BEFORE issuing the next frame's queries
// (see step 1.3 above). Increment the frame counter here so the
// next call computes a fresh slot.
if (_gpuQueriesInitialized) _gpuQueryFrameIndex++;
_drawsIssued += _opaqueDrawCount + _transparentDrawCount;
```
- [ ] **Step 1.6: Update Dispose to delete the full ring (~line 1140)**
**old_string:**
```csharp
if (_gpuQueriesInitialized)
{
_gl.DeleteQuery(_gpuQueryOpaque);
_gl.DeleteQuery(_gpuQueryTransparent);
}
```
**new_string:**
```csharp
if (_gpuQueriesInitialized)
{
for (int i = 0; i < GpuQueryRingDepth; i++)
{
_gl.DeleteQuery(_gpuQueryOpaque[i]);
_gl.DeleteQuery(_gpuQueryTransparent[i]);
}
}
```
- [ ] **Step 1.7: Build**
Run from the worktree root:
```powershell
dotnet build
```
Expected: build succeeds with no new warnings or errors. If the build fails, the most likely cause is a missed string in one of the steps above — re-grep `_gpuQueryOpaque` and `_gpuQueryTransparent` in `WbDrawDispatcher.cs` and confirm every reference uses the array-indexed form `[gpuQuerySlot]` or `[i]`.
- [ ] **Step 1.8: Run the test suite**
```powershell
dotnet test --no-build
```
Expected: same pass/fail baseline as before the change (~1688 passing, ~8 pre-existing physics/input failures unchanged). No new failures.
- [ ] **Step 1.9: Manual verification — launch live and confirm `gpu_us` reports non-zero**
```powershell
$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG = "1"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task1-verify.log"
```
In-world: walk Holtburg for ~30 seconds. Close the window when done.
Verification check on `task1-verify.log`:
```powershell
Select-String -Path task1-verify.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 5
```
Expected output: at least one `[WB-DIAG]` line where `gpu_us=Xm/Yp95` has X > 0 (typically tens to low-hundreds of microseconds at radius=4-12 on a modern GPU). If `gpu_us=0m/0p95` persists for the entire run, the fix didn't take — check whether the build actually rebuilt (try `dotnet build -c Debug` then re-launch).
Also confirm: no visible regression in the client. Entities render, animations play, sky cycles. Close the client cleanly.
- [ ] **Step 1.10: Commit**
```powershell
git add src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs
git commit -m @'
feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher
The dispatcher's GPU TimeElapsed queries were polled in the same frame
as the indirect draw, so glGetQueryObject(ResultAvailable) always
returned 0 and gpu_us in [WB-DIAG] was stuck at 0m/0p95.
Replace the 2 single-handle queries with ring-of-3 arrays and move the
result read to BEFORE issuing the next frame's queries into the same
slot — at frame N we read slot N%3 which holds frame N-3's queries
(oldest in the ring, ~50ms old at 60fps and definitely done across all
desktop GL drivers). Vendor-neutral: AMD/NVIDIA/Intel desktop GL all
work without driver-specific code.
No new tests — the change is purely a diagnostic readout fix, no
observable behavior in the rendering path. Manual verification:
[WB-DIAG] now reports non-zero gpu_us at Holtburg radius=12.
Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'@
git status
```
Expected: clean working tree after commit. Note the new commit SHA — needed for the baseline doc's "measured against" reference.
---
## Task 2: Surface-format histogram dump path (part of commit 2 setup)
**Files:**
- Modify: `src/AcDream.App/Rendering/TextureCache.cs`
- Modify: `src/AcDream.App/Rendering/GameWindow.cs`
This task adds the env-gated one-shot dump infrastructure. It does NOT commit — the commit happens in Task 4 after the baseline document is also ready.
- [ ] **Step 2.1: Add upload-time metadata tracking in `TextureCache.cs`**
Add a new private dictionary that records `(width, height, formatLabel)` keyed by GL texture name. This lets `DumpSurfaceHistogram` emit dimension/format data without re-querying GL.
Use Edit to insert the field right after the existing bindless cache fields (~line 41, just after `_bindlessByPalette`):
**old_string:**
```csharp
private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();
public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)
```
**new_string:**
```csharp
private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();
// Phase N.6 slice 1 (2026-05-11): per-upload metadata for the
// ACDREAM_DUMP_SURFACES=1 histogram dump path. Populated at upload
// time so the dump method doesn't have to query GL state. Keyed by
// GL texture name (same key used in cache value tuples). Format
// label is "RGBA8_DECODED" for the post-decode upload (all uploads
// currently land as RGBA8 regardless of source format).
private readonly Dictionary<uint, (int Width, int Height, string Format)> _uploadMetadata = new();
// Frame counter for the one-shot ACDREAM_DUMP_SURFACES=1 trigger.
// Increments per Tick call; fires the dump once at frame index 600
// and never again for the session. See spec §5.
private int _dumpFrameCounter;
private bool _surfaceHistogramAlreadyDumped;
public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)
```
- [ ] **Step 2.2: Find the `UploadRgba8AsLayer1Array` method and record metadata there**
Locate the method using Grep:
```
pattern: "UploadRgba8AsLayer1Array"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: content
-n: true
```
Read the method body (typically ~30-50 lines) to find the exact `return name;` line. The decoded texture has `decoded.Width`, `decoded.Height`, and `decoded.Rgba8` available.
For each `return name;` in `UploadRgba8AsLayer1Array(DecodedTexture decoded)`, insert this line immediately before it:
```csharp
_uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED");
```
If the method has only one `return name;` near its end, that's a single Edit. Use the surrounding 2-3 lines of context in `old_string` to make the Edit unique.
- [ ] **Step 2.3: Also record metadata in the legacy `UploadRgba8` (non-bindless) path**
Locate the method:
```
pattern: "private uint UploadRgba8\b"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: content
-n: true
```
Apply the same `_uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED");` insertion before each `return name;` in `UploadRgba8(DecodedTexture decoded)`. This ensures the dump captures both legacy and modern uploads.
- [ ] **Step 2.4: Add the `TickSurfaceHistogramDumpIfEnabled` public method to `TextureCache.cs`**
Locate `HashPaletteOverride` using Grep:
```
pattern: "internal static ulong HashPaletteOverride"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: content
-n: true
-A: 20
```
Identify its closing brace. Use Edit with surrounding context to insert the new methods immediately after.
**old_string:** (the last few lines of `HashPaletteOverride`):
```csharp
foreach (var sp in p.SubPalettes)
{
h = (h ^ sp.SubPaletteId) * prime;
h = (h ^ sp.Offset) * prime;
h = (h ^ sp.Length) * prime;
}
return h;
}
```
**new_string:**
```csharp
foreach (var sp in p.SubPalettes)
{
h = (h ^ sp.SubPaletteId) * prime;
h = (h ^ sp.Offset) * prime;
h = (h ^ sp.Length) * prime;
}
return h;
}
/// <summary>
/// Phase N.6 slice 1: one-shot surface-format histogram dump for the
/// atlas-opportunity audit. Activated by ACDREAM_DUMP_SURFACES=1; fires
/// once at frame 600 of the session (~10s at 60fps, ~3s at 200fps —
/// both well past streaming settle at radius≤12). Output goes to
/// %LOCALAPPDATA%\acdream\n6-surfaces.txt. Zero cost when off.
/// See spec §5 in docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
/// </summary>
public void TickSurfaceHistogramDumpIfEnabled()
{
if (_surfaceHistogramAlreadyDumped) return;
if (!string.Equals(Environment.GetEnvironmentVariable("ACDREAM_DUMP_SURFACES"), "1", StringComparison.Ordinal)) return;
_dumpFrameCounter++;
if (_dumpFrameCounter < 600) return;
DumpSurfaceHistogram();
_surfaceHistogramAlreadyDumped = true;
}
private void DumpSurfaceHistogram()
{
var localAppData = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData);
var outDir = System.IO.Path.Combine(localAppData, "acdream");
System.IO.Directory.CreateDirectory(outDir);
var outPath = System.IO.Path.Combine(outDir, "n6-surfaces.txt");
var sb = new System.Text.StringBuilder();
sb.AppendLine($"# acdream surface-format histogram — generated {DateTime.UtcNow:yyyy-MM-ddTHH:mm:ssZ}");
sb.AppendLine("# Per-entry: surfaceId(hex), width, height, format, byteCount");
sb.AppendLine();
// Walk every cached entry across the 6 caches, dedupe by GL name.
var seen = new HashSet<uint>();
long totalBytes = 0;
var bucketsByDim = new Dictionary<(int W, int H), int>();
var bucketsByFormat = new Dictionary<string, int>();
var bucketsByTriple = new Dictionary<(int W, int H, string F), int>();
void Emit(uint surfaceId, uint name)
{
if (!seen.Add(name)) return;
if (!_uploadMetadata.TryGetValue(name, out var meta)) return;
int bytes = meta.Width * meta.Height * 4;
totalBytes += bytes;
sb.AppendLine($"0x{surfaceId:X8}, {meta.Width}, {meta.Height}, {meta.Format}, {bytes}");
var dimKey = (meta.Width, meta.Height);
bucketsByDim[dimKey] = bucketsByDim.GetValueOrDefault(dimKey) + 1;
bucketsByFormat[meta.Format] = bucketsByFormat.GetValueOrDefault(meta.Format) + 1;
var tripleKey = (meta.Width, meta.Height, meta.Format);
bucketsByTriple[tripleKey] = bucketsByTriple.GetValueOrDefault(tripleKey) + 1;
}
foreach (var kv in _handlesBySurfaceId) Emit(kv.Key, kv.Value);
foreach (var kv in _handlesByOverridden) Emit(kv.Key.surfaceId, kv.Value);
foreach (var kv in _handlesByPalette) Emit(kv.Key.surfaceId, kv.Value);
foreach (var kv in _bindlessBySurfaceId) Emit(kv.Key, kv.Value.Name);
foreach (var kv in _bindlessByOverridden) Emit(kv.Key.surfaceId, kv.Value.Name);
foreach (var kv in _bindlessByPalette) Emit(kv.Key.surfaceId, kv.Value.Name);
sb.AppendLine();
sb.AppendLine("# Rollups");
sb.AppendLine($"# Total unique GL textures: {seen.Count}");
sb.AppendLine($"# Total bytes (sum of W*H*4): {totalBytes}");
sb.AppendLine("# Top 10 (W,H) dimension buckets:");
foreach (var kv in bucketsByDim.OrderByDescending(kv => kv.Value).Take(10))
sb.AppendLine($"# {kv.Key.W}x{kv.Key.H}: {kv.Value}");
sb.AppendLine("# Format buckets:");
foreach (var kv in bucketsByFormat.OrderByDescending(kv => kv.Value))
sb.AppendLine($"# {kv.Key}: {kv.Value}");
sb.AppendLine("# Top 10 (W,H,format) triples — atlas-opportunity input:");
foreach (var kv in bucketsByTriple.OrderByDescending(kv => kv.Value).Take(10))
sb.AppendLine($"# {kv.Key.W}x{kv.Key.H} {kv.Key.F}: {kv.Value}");
System.IO.File.WriteAllText(outPath, sb.ToString());
Console.WriteLine($"[N6-DUMP] Surface histogram written to {outPath} ({seen.Count} textures, {totalBytes} bytes)");
}
```
- [ ] **Step 2.5: Confirm `using System.Linq;` is present in `TextureCache.cs`**
Read the file's `using` section (top of file). If `using System.Linq;` is NOT present, add it. The `OrderByDescending` and `Take` calls in `DumpSurfaceHistogram` need it.
Pattern:
```
pattern: "^using System\.Linq"
path: src/AcDream.App/Rendering/TextureCache.cs
output_mode: count
```
If count is 0, add `using System.Linq;` in alphabetical order with the other usings at the top of the file.
- [ ] **Step 2.6: Add the per-frame call site in `GameWindow.cs`**
Find a stable insertion point near the top of `OnRender` (starts at line 6288). Use Grep:
```
pattern: "_gl!\.Clear\("
path: src/AcDream.App/Rendering/GameWindow.cs
output_mode: content
-n: true
-A: 3
```
This finds the `Clear` call(s) in or near `OnRender`. The first one after line 6288 is where you want to insert. Read 5 lines of context around it, then Edit to insert the dump tick on the line immediately after the `Clear` call returns:
The insertion (one Edit):
**old_string:** (find the `Clear` call in `OnRender` and capture 1-2 lines of its context — varies; common pattern is `_gl!.Clear(ClearBufferMask.ColorBufferBit | ClearBufferMask.DepthBufferBit);` followed by the next line of `OnRender` work).
**new_string:** the same `Clear` call followed by:
```csharp
// Phase N.6 slice 1: one-shot surface-format histogram dump under
// ACDREAM_DUMP_SURFACES=1. Zero cost when off.
_textureCache?.TickSurfaceHistogramDumpIfEnabled();
```
If `OnRender` has multiple `Clear` calls, place the tick after the first one inside the method body. The call must run exactly once per frame, before any rendering work — placing it right after `Clear` accomplishes both.
- [ ] **Step 2.7: Build**
```powershell
dotnet build
```
Expected: build succeeds with no new warnings. If a "name 'OrderByDescending' does not exist in current context" error appears, Step 2.5 was missed — add the `using System.Linq;` and rebuild.
- [ ] **Step 2.8: Run the test suite**
```powershell
dotnet test --no-build
```
Expected: same pass/fail baseline (~1688 passing, ~8 pre-existing failures). No new failures.
- [ ] **Step 2.9: Manual verification — confirm the dump file appears**
Launch with the dump env var on:
```powershell
$env:ACDREAM_DUMP_SURFACES = "1"
$env:ACDREAM_WB_DIAG = "1"
# Other env vars same as Task 1 Step 1.9
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task2-verify.log"
```
Wait ~15 seconds after the window appears, then close it. Check the file:
```powershell
Get-Content "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" | Select-Object -First 30
```
Expected: a non-empty file with the header, per-entry rows, and rollup sections. Also confirm one `[N6-DUMP] Surface histogram written to ...` line in `task2-verify.log` (just before window close).
If the file is empty or missing:
- Check the launch log for the `[N6-DUMP]` line.
- If it's not there, `_dumpFrameCounter` didn't reach 600 — the user closed too early. Re-run and wait longer.
- If it's there but the file lookup fails, the path output in the log should show what was actually written; investigate that path.
**Do not commit yet.** Continue to Task 3.
---
## Task 3: Capture baseline measurements
**Files:**
- Create: `docs/plans/2026-05-11-phase-n6-perf-baseline.md` (final content lands in Task 4 — this task just collects the numbers).
This is the manual measurement task. Each step launches the client, runs a specific scenario, and captures the diagnostic output. Save each log separately for the final write-up. Total expected time: ~30-45 min.
Setup once per session:
```powershell
$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG = "1"
```
For each measurement run, set `ACDREAM_STREAM_RADIUS` before launch. Use the `QualityPreset=High` default (no overrides). All runs at Holtburg with `+Acdream` at clear midday (cycle weather with F10 → Clear, time with F7 → Noon).
Per run, after ~30 seconds at the target condition, close the window and grep the log for the last 3 `[WB-DIAG]` lines — those have the steady-state numbers.
- [ ] **Step 3.1: Capture radius=4 standstill**
```powershell
$env:ACDREAM_STREAM_RADIUS = "4"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-stand.log"
```
In-world: enter world, do not move, hold position for 30 seconds. Close.
```powershell
Select-String -Path baseline-r4-stand.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3
```
Record from the median of the last 3 lines: `cpu_us`, `gpu_us`, `entSeen`, `entDrawn`, `groups`. Also note the window-title FPS shown during the test.
- [ ] **Step 3.2: Capture radius=4 walking**
```powershell
$env:ACDREAM_STREAM_RADIUS = "4"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-walk.log"
```
In-world: enter world, Tab to player mode, walk N→E→S→W across one landblock over ~30 seconds. Close.
Capture same numbers as 3.1.
- [ ] **Step 3.3: Capture radius=8 standstill**
```powershell
$env:ACDREAM_STREAM_RADIUS = "8"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-stand.log"
```
Same procedure as 3.1. Wait ~40 seconds before recording (streaming takes longer to settle).
- [ ] **Step 3.4: Capture radius=8 walking**
```powershell
$env:ACDREAM_STREAM_RADIUS = "8"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-walk.log"
```
Same procedure as 3.2.
- [ ] **Step 3.5: Capture radius=12 standstill**
```powershell
$env:ACDREAM_STREAM_RADIUS = "12"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-stand.log"
```
Same procedure as 3.1. Wait ~60 seconds before recording. This is the headline measurement — pay attention to whether `gpu_us` p95 is well below 16.6 ms (60 fps target) or pushing it.
- [ ] **Step 3.6: Capture radius=12 walking**
```powershell
$env:ACDREAM_STREAM_RADIUS = "12"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-walk.log"
```
Same procedure as 3.2 (walking across one landblock, ~30 seconds of motion within the 60s+ window).
- [ ] **Step 3.7: Capture the surface histogram**
```powershell
$env:ACDREAM_STREAM_RADIUS = "12"
$env:ACDREAM_DUMP_SURFACES = "1"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-surfaces.log"
```
In-world: enter world at Holtburg, do nothing for ~30 seconds (let the dump fire at frame 600). Close. Copy the file:
```powershell
Copy-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -Destination "baseline-surfaces.txt"
```
Inspect:
```powershell
Get-Content baseline-surfaces.txt | Select-Object -Last 40
```
Record the rollup section (total textures, total bytes, top 10 dimension buckets, format distribution, top 10 (W,H,format) triples).
- [ ] **Step 3.8: Clean up the env vars and the local app data dump**
```powershell
Remove-Item Env:\ACDREAM_DUMP_SURFACES -ErrorAction SilentlyContinue
Remove-Item Env:\ACDREAM_STREAM_RADIUS -ErrorAction SilentlyContinue
# Optional: clean up the source file so a future re-measurement isn't confused by stale data
Remove-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -ErrorAction SilentlyContinue
```
All log files (`baseline-r*-*.log`, `baseline-surfaces.log`, `baseline-surfaces.txt`) remain in the worktree root for Task 4. They will NOT be committed — they're scratch.
---
## Task 4: Write baseline doc + amend roadmap + ship commit 2
**Files:**
- Create: `docs/plans/2026-05-11-phase-n6-perf-baseline.md`
- Modify: `docs/plans/2026-04-11-roadmap.md` lines 690-705
- [ ] **Step 4.1: Write the baseline document**
Use Write to create `docs/plans/2026-05-11-phase-n6-perf-baseline.md` with this content (substitute real numbers from Task 3 captures into every `<n>` and `<pct>` placeholder; do NOT leave any unfilled):
```markdown
# Phase N.6 slice 1 — perf baseline at Holtburg
**Created:** 2026-05-11.
**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md)
**Measured against commit:** <commit SHA from Task 1.10>
**Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.
---
## §1. Setup
- **Hardware:** Radeon RX 9070 XT
- **Resolution:** 1440p (2560×1440)
- **Quality preset:** High (default)
- **Connection:** live ACE at `127.0.0.1:9000`
- **Character:** `+Acdream` at Holtburg
- **Sky / time:** clear midday (F7 → Noon, F10 → Clear)
- **Build:** Debug
- **Date measured:** 2026-05-11
- **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=<per-run>`
## §2. Dispatch CPU / GPU numbers
Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window. `entSeen / entDrawn / groups` are also from those lines. FPS read from the window title.
| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | FPS | entSeen | entDrawn | groups |
|---|---|---|---|---|---|---|---|---|---|
| 4 | standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 4 | walking | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 8 | standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 8 | walking | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 12| standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
| 12| walking | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
## §3. Surface-format histogram
From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world.
- **Total unique GL textures:** <n>
- **Total bytes (sum of W*H*4):** <n>
- **Top 10 (W, H) dimension buckets:**
- `<W>x<H>`: <count>
- ... (paste from baseline-surfaces.txt rollup)
- **Format distribution:**
- `<format>`: <count>
- **Top 10 (W, H, format) triples — atlas-opportunity input:**
- `<W>x<H> <format>`: <count>
- ...
**Atlas-opportunity score:** <pct>% of surfaces fall into the top-3 (W, H, format) triples. (A score >30% means atlas consolidation could meaningfully reduce sampler switches + memory overhead; <15% means scattered content and atlas is not worth the slice-2 effort.)
## §4. Conclusion + next-phase recommendation
<Opinionated paragraph addressing:
1. Is the entity dispatcher CPU-bound or GPU-bound at radius=12?
- Compare cpu_us p95 vs gpu_us p95. The larger one is the bottleneck.
2. Does gpu_us p95 leave headroom at 60 fps target (16.6 ms / 16600 µs)?
- If gpu_us p95 < 8000 µs: comfortable headroom.
- If gpu_us p95 < 14000 µs: tight but OK.
- If gpu_us p95 >= 14000 µs: GPU-saturated, persistent-mapped buffers and compute cull help.
3. Does the atlas score justify slice-2 atlas work?
4. Given (1)-(3), which is the right next phase?
- CPU-bound + low atlas score: pivot to C.1.5 (visible content, perf already comfortable).
- GPU-bound + high atlas score: do N.6 slice 2 (atlas + persistent buffers).
- Either-bound + headroom + low atlas score: do C.1.5 first.
- GPU saturated + need for more headroom: escalate to Tier 2.>
## §5. Raw logs
Scratch logs from this measurement run (not committed):
- `baseline-r4-stand.log`, `baseline-r4-walk.log`
- `baseline-r8-stand.log`, `baseline-r8-walk.log`
- `baseline-r12-stand.log`, `baseline-r12-walk.log`
- `baseline-surfaces.log`, `baseline-surfaces.txt`
```
Fill in every `<n>` and `<pct>` and the conclusion paragraph with the real values from Task 3. **Do NOT leave any `<n>` placeholders.** If a measurement is missing, re-run that step from Task 3 before continuing.
- [ ] **Step 4.2: Read the current roadmap N.6 entry**
```
Read offset 685, limit 25 from docs/plans/2026-04-11-roadmap.md
```
Confirm the bullet starts with `- **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).**` and ends with `Plan + spec written when work begins. **Estimate: 1-2 weeks.**`. Capture the exact text verbatim for Step 4.3's `old_string`.
- [ ] **Step 4.3: Amend the roadmap entry**
Use Edit. The change splits N.6 into slice 1 (shipping with this commit) and slice 2 (deferred until after C.1.5).
**old_string:** the exact N.6 bullet copied from the Read in Step 4.2.
**new_string:**
```markdown
- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.**
Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
desktop GL). Added env-gated surface-format histogram dump in `TextureCache`
for atlas-opportunity audit. Captured authoritative baseline at Holtburg
radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us`
diagnostic. Plan + spec at `docs/superpowers/{specs,plans}/2026-05-11-phase-n6-slice1-*.md`.
Baseline numbers + next-phase recommendation at
[docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md).
- **N.6 slice 2 — Perf polish cleanup.** **Planned — deferred until after C.1.5
(PES emitter wiring) per the baseline doc's recommendation.** Builds on
slice 1's measurement. Scope: retire the legacy `Texture2D`/`sampler2D` path
in `TextureCache` (currently kept for Sky + Debug + particle paths now that
Terrain has migrated); delete orphan `mesh.frag` (verify zero callers post-N.5
amendment); decide bindless-everywhere vs legacy-island for the remaining
`sampler2D` consumers; conditionally adopt WB atlas if the slice-1 histogram
shows a real opportunity; conditionally adopt persistent-mapped buffers if
the slice-1 baseline shows `BufferSubData` as a hot spot; GPU compute culling
remains out-of-scope (that's Tier 3 of the perf-tiers roadmap, gated on
Tier 2 first). Plan + spec written when work begins. **Estimate: 1-2 weeks
once C.1.5 lands.**
```
- [ ] **Step 4.4: Build (sanity check — only docs touched, but be safe)**
```powershell
dotnet build
```
Expected: build succeeds. (No code touched in Task 4; this just confirms nothing was accidentally edited in src/.)
- [ ] **Step 4.5: Commit 2**
```powershell
git add src/AcDream.App/Rendering/TextureCache.cs `
src/AcDream.App/Rendering/GameWindow.cs `
docs/plans/2026-05-11-phase-n6-perf-baseline.md `
docs/plans/2026-04-11-roadmap.md
git commit -m @'
docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path
Capture authoritative CPU+GPU dispatch numbers at Holtburg with the
gpu_us diagnostic now working (commit <prev SHA from Task 1.10>). Three
radii (4/8/12) × two motion modes (standstill/walking) + a surface-format
histogram from ACDREAM_DUMP_SURFACES=1.
Adds env-gated one-shot dump path (TextureCache.TickSurfaceHistogramDumpIfEnabled,
called from GameWindow.OnRender) that fires once at frame 600 of the
session — zero cost when off, writes to %LOCALAPPDATA%\acdream\n6-surfaces.txt.
Baseline document at docs/plans/2026-05-11-phase-n6-perf-baseline.md
closes with a recommendation paragraph for the next phase. Roadmap entry
amended to reflect the slice 1 / slice 2 split.
Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§5, §6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'@
git status
```
Expected: clean working tree.
- [ ] **Step 4.6: Final sanity sweep**
```powershell
git log -3 --oneline
```
Expected: two new commits from this slice (the GPU timing fix from Task 1.10, then this docs/perf commit), under the spec commit `05d590c`.
Also confirm the scratch baseline-r*.log and baseline-surfaces.* files are still NOT in the commit (they were not staged):
```powershell
git status
```
Expected: clean working tree. If the scratch logs show as untracked but uncommitted, that's fine — they can be deleted manually:
```powershell
Remove-Item baseline-r*.log, baseline-surfaces.log, baseline-surfaces.txt, task1-verify.log, task2-verify.log -ErrorAction SilentlyContinue
```
---
## Acceptance check (spec §9)
After Task 4 commits, walk through the spec's acceptance criteria and confirm each one. This is a paper-walk, not a re-run — the steps above produce the conditions.
- [ ] **A1: `[WB-DIAG]` reports non-zero `gpu_us` at radius=12.**
Verified in Task 1.9 (initial check) and Task 3.5-3.6 (full baseline run). Confirm by re-grepping `baseline-r12-stand.log`:
```powershell
Select-String -Path baseline-r12-stand.log -Pattern "gpu_us=[1-9]"
```
Should return at least one line.
- [ ] **A2: Vendor-neutral.** No `GL_*_NV` or `GL_*_AMD` or `GL_*_INTEL` extension references in the change. Re-grep:
```powershell
Select-String -Path src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs -Pattern "NV_|AMD_|INTEL_|GL_NV|GL_AMD|GL_INTEL"
```
Expected: no matches in the new code (matches elsewhere in the file from unrelated existing code don't count).
- [ ] **A3: Baseline doc has real numbers + conclusion.**
Open `docs/plans/2026-05-11-phase-n6-perf-baseline.md` and visually confirm no `<n>`, `<pct>`, `TBD`, or empty conclusion section.
- [ ] **A4: Roadmap split shipped.**
```powershell
Select-String -Path docs/plans/2026-04-11-roadmap.md -Pattern "N\.6 slice"
```
Expected: two matches (slice 1 + slice 2 bullets).
- [ ] **A5: `dotnet build` green, no new warnings.**
```powershell
dotnet build
```
Expected: succeeds. Note any new warnings vs the build output before the slice started.
- [ ] **A6: `dotnet test` green at baseline (~1688 passing, ~8 pre-existing failures).**
```powershell
dotnet test --no-build
```
Expected: pass count unchanged from before the slice started; failure list unchanged.
- [ ] **A7: No visible regression.**
Confirmed during Task 1.9 and Task 3 measurements — the user was in-world repeatedly and didn't observe any rendering issue. If anything looked off during measurement, file it as an issue and decide whether it blocks slice 1 acceptance.
If any acceptance criterion fails, return to the relevant task and re-do it. Do not declare slice 1 complete with failing acceptance.
---
## After slice 1 lands
The baseline document's conclusion paragraph (§4) determines the next phase:
- **If conclusion recommends C.1.5:** brainstorm C.1.5 spec next, using [docs/plans/2026-04-27-phase-c1-pes-particles.md:285-295](../../plans/2026-04-27-phase-c1-pes-particles.md) as the starting scope.
- **If conclusion recommends N.6 slice 2:** brainstorm slice 2 spec next, addressing legacy `TextureCache` cleanup + atlas + persistent-mapped buffers based on the histogram data.
- **If conclusion recommends Tier 2:** consult [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md) and brainstorm a Tier 2 spec.
The choice is data-driven; the recommendation paragraph is the contract. Don't re-litigate the decision once the numbers are in.