Merge branch 'claude/objective-brown-86e645' — Phase N.6 slice 1 (gpu_us fix + perf baseline)

Slice 1 of Phase N.6: unblocked the gpu_us diagnostic in WbDrawDispatcher (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL) and captured the radius=12 perf baseline at Holtburg with the now-working diagnostic. Headline data: CPU dominates GPU by 30-50× at every measured radius; GPU dispatch p95 maxes at 603µs (3.6% of 16.6ms frame budget); CPU grows more than linearly with N₁ (3.2 → 6.7 → 12.9 ms as N₁ goes 4 → 8 → 12). The per-LB walk in the dispatcher is the next bottleneck if perf ever becomes tight. Recommendation in the baseline doc: C.1.5 (PES emitter wiring — portals, chimneys, fireplaces) next; reduced-scope N.6 slice 2 after that (drop atlas + persistent-mapped buffers per the GPU-underutilized finding); Tier 2 only if perf escalation becomes pressing. New issue #55 filed: static-entity slow path reports ~1.45M meshMissing per 5s at r4 standstill (LOW severity, no visible regression). Plan: docs/superpowers/plans/2026-05-11-phase-n6-slice1.md Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md Baseline: docs/plans/2026-05-11-phase-n6-perf-baseline.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:56:26 +02:00 · 2026-05-11 12:56:26 +02:00 · 9b447d4ca8
commit 9b447d4ca8
parent 175ad14f8b 41981c4d74
8 changed files with 1677 additions and 39 deletions
--- a/docs/ISSUES.md
+++ b/docs/ISSUES.md
@ -46,6 +46,40 @@ Copy this block when adding a new issue:

 # Active issues

+## #55 — Static-entity slow path reports ~1.45M `meshMissing` per 5s at r4 standstill
+
+**Status:** OPEN
+**Severity:** LOW (no visible regression — affects a diagnostic counter, not rendered output)
+**Filed:** 2026-05-11
+**Component:** rendering / `WbDrawDispatcher` static-entity classification path
+
+**Description:** During the Phase N.6 slice 1 baseline measurement (`docs/plans/2026-05-11-phase-n6-perf-baseline.md` §2),
+the radius=4 standstill scenario reported `meshMissing ≈ 1,450,000` per 5-second
+`[WB-DIAG]` window. The same scenario while walking drops to near-zero (`meshMissing = 0`
+in the steady state) as new landblocks stream in and previously-missing meshes resolve.
+This suggests the static-entity slow path's mesh-load lifecycle has some delay before
+populating for newly-streamed content but eventually catches up; the standstill case
+keeps re-counting the same set of entities-with-unresolved-meshes for the duration of
+the run. The counter is per-frame so the absolute number scales with FPS — at the
+measured ~150 FPS that's ~290K reports/s, or ~1900 entities each reported each frame.
+
+**Root cause / status:** Not investigated. Hypothesis: an entity classification path
+counts mesh-missing on every frame for static entities whose `MeshRef` resolution races
+the streaming loader. The Tier 1 cache (#53) populates only for entities whose
+classification succeeded, so persistently-failing entities run the slow path every frame
+forever and bump `meshMissing` every time. If true, the fix is either (a) cache the
+"this entity's mesh genuinely doesn't exist" result so we stop re-checking, or (b)
+deferred-classify the entity once its `MeshRef` resolves.
+
+**Files:** `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` (the slow path that
+increments `_meshesMissing`), `src/AcDream.App/Rendering/Wb/EntityClassificationCache.cs`
+(the Tier 1 cache — likely needs to learn about "permanently missing" entries).
+
+**Acceptance:** `meshMissing` should drop to near-zero within ~5 seconds of streaming
+settle at any radius/motion combination, not stay at ~1.45M/5s indefinitely at standstill.
+
+---
+
 ## #50 — Road-edge tree at 0xA9B1 visible in acdream but not retail

 **Status:** OPEN
--- a/docs/plans/2026-04-11-roadmap.md
+++ b/docs/plans/2026-04-11-roadmap.md
@ -63,6 +63,7 @@
 | N.4 | Rendering pipeline foundation — adopted WB's `ObjectMeshManager` as the production mesh pipeline behind `ACDREAM_USE_WB_FOUNDATION` (default-on). `WbMeshAdapter` is the single seam (owns `ObjectMeshManager`, drains the staged-upload queue per frame, populates `AcSurfaceMetadataTable` with per-batch translucency / luminosity / fog metadata). `WbDrawDispatcher` is the production draw path: groups all visible (entity, batch) pairs, single-uploads the matrix buffer, fires one `glDrawElementsInstancedBaseVertexBaseInstance` per group with `BaseInstance` slicing into the shared instance VBO. `LandblockSpawnAdapter` + `EntitySpawnAdapter` bridge spawn lifecycle to WB ref-counts (atlas tier vs per-instance). Perf wins shipped as part of N.4: per-entity frustum cull, opaque front-to-back sort, palette-hash memoization (compute once per entity, reuse across batches). Visual verification at Holtburg passed: scenery + connected characters with full close-detail geometry (Issue #47 regression resolved). Legacy `InstancedMeshRenderer` retained as `ACDREAM_USE_WB_FOUNDATION=0` escape hatch until N.6 (retired early in N.5 ship amendment). | Live ✓ |
 | N.5 | Modern rendering path — lifted `WbDrawDispatcher` onto bindless textures (`GL_ARB_bindless_texture`) + `glMultiDrawElementsIndirect`. Per-frame entity rendering: 3 SSBO uploads (instance matrices @ binding=0, batch data @ binding=1, indirect commands) + 2 indirect draw calls (opaque + transparent). ~12-15 GL calls per frame regardless of group count, down from hundreds-of-per-group in N.4. CPU dispatcher: 1.23 ms/frame median at Holtburg courtyard (1662 groups, ~810 fps sustained). All textures on the WB modern path use 1-layer `Texture2DArray` + `sampler2DArray`. Legacy callers keep `Texture2D` / `sampler2D` via the parallel `TextureCache` path until N.6 retires them. Three gotchas captured in memory: texture target lock-in, bindless Dispose order (two-phase non-resident before delete), GL_TIME_ELAPSED double-buffering. **Ship amendment 2026-05-08:** legacy renderers (`InstancedMeshRenderer`, `StaticMeshRenderer`, `WbFoundationFlag`) retired within N.5 — modern path is mandatory; missing bindless throws `NotSupportedException` at startup. N.6 scope narrowed accordingly. Plan archived at `docs/superpowers/plans/2026-05-08-phase-n5-modern-rendering.md`. | Live ✓ |
 | N.5b | Terrain on the modern rendering path — `TerrainModernRenderer` replaces `TerrainChunkRenderer` (the latter plus `TerrainRenderer` + `terrain.vert/.frag` deleted). Single global VBO/EBO with slot allocator (one slot per landblock), per-frame `DrawElementsIndirectCommand[]` upload + `glMultiDrawElementsIndirect`, bindless atlas handles passed as `uvec2` uniforms reconstructed via `sampler2DArray(handle)`. **Path C** chosen: mirrors WB's `TerrainRenderManager` pattern but consumes `LandblockMesh.Build` so retail's `FSplitNESW` formula is preserved (closes ISSUE #51). Path A killed by 49.98% measured divergence between WB's `CalculateSplitDirection` and retail's at addr `00531d10`; Path B (fork-patch WB) rejected for permanent maintenance burden. Perf at Holtburg radius=5 (commit `da56063`): modern 6.4-7.0 µs / 9-14 µs p95 vs legacy 1.5 µs / 3.0 µs — **modern is ~4× SLOWER on CPU at radius=5** because legacy's 16×16-LB chunking collapsed visible LBs to one `glDrawElements`. Architectural wins (zero `glBindTexture`/frame, constant-cost dispatch, per-LB frustum cull) manifest at higher radius (A.5 territory). Spec acceptance criterion 5 ("≥10% lower CPU at radius=5") amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Three gotchas captured in memory: `uniform sampler2DArray` + `glProgramUniformHandleARB` GL_INVALID_OPERATIONs on at least one driver (use `uniform uvec2` + `sampler2DArray(handle)` constructor instead — N.5's mesh_modern pattern); `MaybeFlushTerrainDiag` median-calc underflow on first sample; visual gates need actual visual confirmation, not assent. Plan archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`. | Live ✓ |
+| N.6 slice 1 | GPU timing fix + radius=12 perf baseline. Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3 query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel desktop GL). Added env-gated `ACDREAM_DUMP_SURFACES=1` one-shot surface-format histogram dump in `TextureCache` for the atlas-opportunity audit. Captured authoritative baseline at Holtburg radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us` diagnostic; baseline doc concludes CPU dominates GPU by 30–50× at every radius and recommends C.1.5 next then reduced-scope slice 2 (atlas + persistent-mapped buffers dropped). Baseline numbers at [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md). Plan archived at `docs/superpowers/plans/2026-05-11-phase-n6-slice1.md`. | Live ✓ |

 Plus polish that doesn't get its own phase number:
 - FlyCamera default speed lowered + Shift-to-boost
@ -687,22 +688,26 @@ for our deletions/additions; merge upstream `master` periodically.
  manifest at higher radius. Spec acceptance criterion #5 was wrong;
  amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan
  archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`.
- **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).**
-  Builds on N.5 + N.5b. Legacy renderer retirement was pulled forward
-  into N.5 ship amendment — `InstancedMeshRenderer`, `StaticMeshRenderer`,
-  `WbFoundationFlag` are gone — and the terrain legacy renderer
-  (`TerrainChunkRenderer` + `TerrainRenderer` + `terrain.vert/.frag`)
-  retired in N.5b. N.6 scope: WB atlas adoption for memory savings
-  on shared content, persistent-mapped buffers if `glBufferData` shows
-  up in profiling (the modern terrain path's per-frame DEIC `BufferSubData`
-  is a candidate), GPU-side culling via compute pre-pass (eliminates
-  the per-frame slot walk + DEIC build entirely), GL_TIME_ELAPSED query
-  double-buffering (deferred from N.5 — diagnostic shows `gpu_us=0/0`
-  under `ACDREAM_WB_DIAG=1`), direct higher-radius perf comparison (A.5
-  has now landed — modern's architectural wins are measurable), retire the
-  legacy `Texture2D`/`sampler2D` path in `TextureCache` (currently kept
-  for Sky + Debug + particle paths now that Terrain has migrated).
-  Plan + spec written when work begins. **Estimate: 1-2 weeks.**
+- **✓ SHIPPED — N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** Shipped 2026-05-11.
+  Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
+  query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
+  desktop GL). Added env-gated surface-format histogram dump in `TextureCache`
+  for atlas-opportunity audit. Captured authoritative baseline at Holtburg
+  radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us`
+  diagnostic. Plan + spec at `docs/superpowers/{specs,plans}/2026-05-11-phase-n6-slice1-*.md`.
+  Baseline numbers + next-phase recommendation at
+  [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md).
+- **N.6 slice 2 — Perf polish cleanup.** **Planned — deferred until after C.1.5
+  (PES emitter wiring) per the baseline doc's recommendation.** Builds on
+  slice 1's measurement. Scope: retire the legacy `Texture2D`/`sampler2D` path
+  in `TextureCache` (currently kept for Sky + Debug + particle paths now that
+  Terrain has migrated); delete orphan `mesh.frag` (verify zero callers post-N.5
+  amendment); decide bindless-everywhere vs legacy-island for the remaining
+  `sampler2D` consumers. **Dropped from slice 2 scope per baseline data**:
+  WB atlas adoption and persistent-mapped buffers — both target GPU/sampler
+  throughput but the baseline shows GPU is wildly under-utilized (max gpu_us
+  p95 ~600 µs vs 16,600 µs frame budget). Slice 2 reduces to a ~1-day cleanup.
+  Plan + spec written when work begins. **Estimate: ~1 day once C.1.5 lands.**
 - **N.7 — EnvCells / dungeons.** Replace EnvCell rendering with WB's
  `EnvCellRenderManager` + `PortalRenderManager` on top of N.4's
  foundation. **Estimate: 1-2 weeks** (was 2-3 — naturally smaller now
--- a/docs/plans/2026-05-11-phase-n6-perf-baseline.md
+++ b/docs/plans/2026-05-11-phase-n6-perf-baseline.md
@ -0,0 +1,193 @@
+# Phase N.6 slice 1 — perf baseline at Holtburg
+
+**Created:** 2026-05-11.
+**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md)
+**Measured against commit:** `25cb147` (Task 1 final — gpu_us fix + diag-gate symmetry follow-up)
+**Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.
+
+---
+
+## §1. Setup
+
+- **Hardware:** Radeon RX 9070 XT
+- **Resolution:** 1440p (2560×1440)
+- **Quality preset:** High (default)
+- **Connection:** live ACE at `127.0.0.1:9000`
+- **Character:** `+Acdream` at Holtburg
+- **Sky / time:** clear midday (F7 → Noon, F10 → Clear)
+- **Build:** Debug
+- **Date measured:** 2026-05-11
+- **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=<per-run>`
+
+Note: `ACDREAM_STREAM_RADIUS=N` forces N₁=N (all N near-tier landblocks at full detail).
+This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in
+CLAUDE.md as comfortable 200–400 FPS at the default preset. These measurements
+characterize the scaling curve — what happens as near-tier radius grows — not current
+production behavior. FPS was not captured directly (no window-title screenshot per run);
+it can be derived from `(1e6 / total_frame_time_us)` but the dispatcher's `cpu_us` is
+only part of the frame (terrain, sky, particles, UI, GL submission overhead, and
+swap-buffer wait are not included).
+
+## §2. Dispatch CPU / GPU numbers
+
+Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window.
+`entSeen / entDrawn / groups / drawsIssued` are also from those lines (values per 5s bucket).
+FPS column omitted — not captured per the note above.
+
+| Radius | Motion     | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | entSeen (per 5s) | entDrawn (per 5s) | groups | drawsIssued (per 5s) |
+|--------|------------|---------------|------------|---------------|------------|------------------|-------------------|--------|----------------------|
+| 4      | standstill | 3,208         | 3,313      | 93            | 95         | 16.9M            | 15.5M             | 1,216  | 1.65M                |
+| 4      | walking    | 2,967         | 3,112      | 95            | 120        | 13.9M            | 13.9M             | 1,850  | 1.45M                |
+| 8      | standstill | 6,732         | 7,199      | 126           | 130        | 19.8M            | 19.8M             | 333    | 218K                 |
+| 8      | walking    | 6,572         | 6,927      | 96            | 113        | 18.1M            | 18.0M             | 534    | 245K                 |
+| 12     | standstill | 12,853        | 13,525     | 344           | 507        | 19.6M            | 19.6M             | 541    | 184K                 |
+| 12     | walking    | 16,320        | 17,241     | 553           | 603        | 17.8M            | 17.8M             | 898    | 200K                 |
+
+**Notable:** `meshMissing` counts at r4 standstill (~1.45M per 5s) drop to near-zero while
+walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay
+before populating for newly-streamed content. Not fatal — doesn't affect rendered output —
+but worth a follow-up issue in `docs/ISSUES.md` if it persists in normal play.
+
+## §3. Surface-format histogram
+
+From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world.
+Output written to `%LOCALAPPDATA%\acdream\n6-surfaces.txt`.
+
+- **Total unique GL textures:** 760
+- **Total bytes (sum of W×H×4):** 96,387,584 (~96.4 MB)
+
+**Top 10 (W, H) dimension buckets:**
+
+| Dimensions | Count | Share |
+|------------|-------|-------|
+| 128×128    | 236   | 31%   |
+| 64×64      | 111   | 15%   |
+| 256×256    | 102   | 13%   |
+| 128×256    | 71    | 9%    |
+| 64×128     | 69    | 9%    |
+| 256×128    | 48    | 6%    |
+| 128×64     | 39    | 5%    |
+| 512×512    | 30    | 4%    |
+| 8×8        | 18    | 2%    |
+| 32×32      | 14    | 2%    |
+
+**Format distribution:**
+
+| Format        | Count | Share |
+|---------------|-------|-------|
+| RGBA8_DECODED | 760   | 100%  |
+
+All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc.
+all decode through `TextureHelpers` before upload). The source-format diversity is real
+but invisible to GL after the decode step.
+
+**Top 10 (W, H, format) triples — atlas-opportunity input:**
+
+Same as the dimension buckets above since there is only one format. The top-3 triples
+(128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**.
+
+**Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples.
+A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas
+packing worth the implementation cost for memory savings; this measurement is well above
+that. However, see §4 for why atlas is not the right next step despite the high score.
+
+## §4. Conclusion + next-phase recommendation
+
+### What the data shows
+
+**The entity dispatcher is strongly CPU-bound.** At every radius, CPU dominates GPU by
+30–50×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the
+ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck.
+
+**GPU is wildly under-utilized.** The highest gpu_us p95 observed is 603 µs at radius=12
+walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly
+3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky,
+particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU
+comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.
+
+**CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with
+visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms →
+6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count
+scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0×
+vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs
+early, but the outer per-LB walk still has to touch each one. The Tier 1 entity-
+classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner
+loop (per-entity classification avoided on cache hits) but the outer walk dominates
+as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at
+`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the
+per-frame LB scan entirely.
+
+**Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12
+(625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81
+full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as
+comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling
+curve for headroom analysis, not the experience a typical player sees.
+
+**Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB
+of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the
+top buckets share single `Texture2DArray` objects rather than each surface owning its
+own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture
+binds — are already near-zero because bindless textures are made resident once at upload
+and never bound per draw. The remaining win is the per-array metadata overhead × N
+surfaces, which is bounded but not dramatic given all surfaces are already power-of-two
+and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on
+the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler
+switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this
+directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory
+saving the process doesn't currently need at 96 MB.
+
+### Recommendation
+
+**Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces).** Four
+reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset
+per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline"
+data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not
+bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped
+and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path
+before any future shader migration work in slice 2 touches `particle.frag`. Starting
+point for C.1.5 scoping: `docs/plans/2026-04-27-phase-c1-pes-particles.md` lines 285–295.
+
+**Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope.** The baseline data
+justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely.
+What remains is a ~1-day cleanup: retire orphan `mesh.frag` (verify zero callers post-N.5
+amendment), collapse dead `_handlesByOverridden` / `_handlesByPalette` legacy caches once
+their callers are confirmed gone, migrate `particle.frag` to bindless sampling after C.1.5
+stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase.
+
+**Tertiary option (if perf escalation becomes pressing): Tier 2 first.** The scaling
+curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the
+bottleneck — exactly what Tier 2's persistent-group structure at
+`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses. Not urgent at the current
+default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the
+200–400 FPS range at N₁=4 shrinks after more content is streamed.
+
+**Decision rule for revisiting:** if future measurement at the default preset shows
+cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question.
+Otherwise, hold the C.1.5 → reduced-slice-2 sequence.
+
+## §5. Reproducing the measurements
+
+Raw `[WB-DIAG]` output from each run was inspected live during measurement and the
+median of the last three steady-state lines from each scenario was transcribed into §2.
+The raw launch logs were not preserved — the captured medians in §2 are the canonical
+record. To reproduce on the same hardware:
+
+```powershell
+$env:ACDREAM_DAT_DIR   = "$env:USERPROFILE\Documents\Asheron's Call"
+$env:ACDREAM_LIVE      = "1"
+$env:ACDREAM_TEST_HOST = "127.0.0.1"
+$env:ACDREAM_TEST_PORT = "9000"
+$env:ACDREAM_TEST_USER = "testaccount"
+$env:ACDREAM_TEST_PASS = "testpassword"
+$env:ACDREAM_WB_DIAG   = "1"
+$env:ACDREAM_STREAM_RADIUS = "4"  # or 8, 12
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline.log"
+```
+
+Stand still for ~30 s at the target radius (60 s at radius 12 to let streaming settle),
+or walk N→E→S→W across one landblock. Then `Select-String -Path baseline.log -Pattern
+"\[WB-DIAG\]" | Select-Object -Last 3` captures the steady-state numbers.
+
+For the surface histogram, also set `$env:ACDREAM_DUMP_SURFACES = "1"`, stay in-world
+~30 s after streaming has loaded ≥100 textures (the cache-size gate), then read
+`$env:LOCALAPPDATA\acdream\n6-surfaces.txt`.
--- a/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md
+++ b/docs/superpowers/plans/2026-05-11-phase-n6-slice1.md
@ -0,0 +1,912 @@
+# Phase N.6 slice 1 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Fix the broken `gpu_us` diagnostic in `WbDrawDispatcher` (vendor-neutral OpenGL query ring) and produce one authoritative perf baseline document at Holtburg radius=12 so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) is grounded in real numbers.
+
+**Architecture:** Two commits. Commit 1 changes only `WbDrawDispatcher.cs` — replaces the two `uint` GL query handles with ring-of-3 arrays and moves the result read to *before* the next frame overwrites the slot (read frame N-3's queries, then overwrite). Commit 2 adds an env-gated surface-format histogram dump in `TextureCache.cs`, captures the actual measurement, writes the baseline doc, and amends the roadmap entry. No new automated tests — the GPU-timing fix has no observable behavior in tests, and the dump path is env-gated diagnostic only; verification is manual launch-and-look.
+
+**Tech Stack:** C# / .NET 10, Silk.NET (OpenGL 4.3+), `dotnet build` / `dotnet test` from PowerShell, live ACE on `127.0.0.1:9000` for in-world verification.
+
+**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../specs/2026-05-11-phase-n6-slice1-design.md) (committed at `05d590c`).
+
+---
+
+## File Structure
+
+| File | Action | Responsibility |
+|---|---|---|
+| [`src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs) | Modify | Replace 2 `uint` query handles with ring-of-3 arrays; move query result read to before next-frame overwrite. |
+| [`src/AcDream.App/Rendering/TextureCache.cs`](../../../src/AcDream.App/Rendering/TextureCache.cs) | Modify | Add upload-time dimension/format tracking + env-gated `TickSurfaceHistogramDumpIfEnabled()` method that fires once at frame 600. |
+| [`src/AcDream.App/Rendering/GameWindow.cs`](../../../src/AcDream.App/Rendering/GameWindow.cs) | Modify | Call `_textureCache.TickSurfaceHistogramDumpIfEnabled()` once per frame in `OnRender`. |
+| `docs/plans/2026-05-11-phase-n6-perf-baseline.md` | Create | Baseline measurement doc: setup, numbers at radii 4/8/12 (standstill + walking), surface histogram summary, conclusion paragraph recommending next phase. |
+| [`docs/plans/2026-04-11-roadmap.md`](../../plans/2026-04-11-roadmap.md) lines 690-705 | Modify | Amend N.6 entry to reflect the slice 1 / slice 2 split. |
+
+---
+
+## Task 1: GPU query ring buffering (commit 1)
+
+**Files:**
+- Modify: `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`
+
+The five edit zones are well-isolated by exact strings. Apply them in order — do NOT reorder; the build won't fail mid-way but the resulting code is easier to review if applied as documented.
+
+- [ ] **Step 1.1: Replace the field declarations (~line 155)**
+
+Use Edit to replace the existing field block:
+
+**old_string:**
+```csharp
+    private uint _gpuQueryOpaque;
+    private uint _gpuQueryTransparent;
+    private readonly long[] _gpuSamples = new long[256];   // microseconds
+    private int _gpuSampleCursor;
+    private bool _gpuQueriesInitialized;
+```
+
+**new_string:**
+```csharp
+    // GPU timing uses a ring of 3 query-pair slots so the read of frame N-3's
+    // result lands when the GPU has finished (~50ms after issue on a typical
+    // 60fps frame). Ring of 3 is the vendor-neutral choice: NVIDIA drivers with
+    // triple-buffering+vsync can queue ~3 frames ahead, AMD typically 1-2,
+    // Intel iGPUs vary. ResultAvailable is the safety guard if the GPU is
+    // still working when we try to read.
+    private const int GpuQueryRingDepth = 3;
+    private readonly uint[] _gpuQueryOpaque      = new uint[GpuQueryRingDepth];
+    private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth];
+    private int _gpuQueryFrameIndex;
+    private readonly long[] _gpuSamples = new long[256];   // microseconds
+    private int _gpuSampleCursor;
+    private bool _gpuQueriesInitialized;
+```
+
+- [ ] **Step 1.2: Replace the init block (~line 347)**
+
+**old_string:**
+```csharp
+        if (diag && !_gpuQueriesInitialized)
+        {
+            _gpuQueryOpaque      = _gl.GenQuery();
+            _gpuQueryTransparent = _gl.GenQuery();
+            _gpuQueriesInitialized = true;
+        }
+```
+
+**new_string:**
+```csharp
+        if (diag && !_gpuQueriesInitialized)
+        {
+            for (int i = 0; i < GpuQueryRingDepth; i++)
+            {
+                _gpuQueryOpaque[i]      = _gl.GenQuery();
+                _gpuQueryTransparent[i] = _gl.GenQuery();
+            }
+            _gpuQueriesInitialized = true;
+        }
+```
+
+- [ ] **Step 1.3: Insert the read-before-overwrite block + compute slot just before the opaque query begin (~line 774)**
+
+This step replaces the existing single-line `BeginQuery` for opaque with a block that first computes the slot, reads the slot's frame N-3 result (gated on having completed one ring), then issues the new query into the same slot.
+
+**old_string:**
+```csharp
+            _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
+            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque);
+```
+
+**new_string:**
+```csharp
+            _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
+
+            // GPU timing: compute this frame's ring slot. We read frame N-3's
+            // result (the oldest data in the ring) before overwriting it with
+            // frame N's queries. See spec §3 Q1/Q2 + §4 in
+            // docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
+            int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth;
+            if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth)
+            {
+                _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail);
+                if (avail != 0)
+                {
+                    _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot],      QueryObjectParameterName.Result, out ulong opaqueNs);
+                    _gl.GetQueryObject(_gpuQueryTransparent[gpuQuerySlot], QueryObjectParameterName.Result, out ulong transNs);
+                    long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
+                    _gpuSamples[_gpuSampleCursor] = gpuUs;
+                    _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
+                }
+                // If avail==0 the sample is dropped silently. MedianMicros
+                // computes over the non-zero subset, so dropped samples don't
+                // poison the median.
+            }
+
+            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[gpuQuerySlot]);
+```
+
+- [ ] **Step 1.4: Update the transparent query begin to use the same slot (~line 823)**
+
+**old_string:**
+```csharp
+            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent);
+```
+
+**new_string:**
+```csharp
+            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent[gpuQuerySlot]);
+```
+
+- [ ] **Step 1.5: Replace the buggy in-frame read block + increment frame counter (~line 849)**
+
+**old_string:**
+```csharp
+            // Read GPU samples non-blocking; the result for the previous frame's
+            // queries should be ready by now. If not, drop the sample (don't stall
+            // the CPU waiting for the GPU).
+            if (_gpuQueriesInitialized)
+            {
+                _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail);
+                if (avail != 0)
+                {
+                    _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs);
+                    _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs);
+                    long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
+                    _gpuSamples[_gpuSampleCursor] = gpuUs;
+                    _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
+                }
+            }
+
+            _drawsIssued     += _opaqueDrawCount + _transparentDrawCount;
+```
+
+**new_string:**
+```csharp
+            // GPU sample read happens BEFORE issuing the next frame's queries
+            // (see step 1.3 above). Increment the frame counter here so the
+            // next call computes a fresh slot.
+            if (_gpuQueriesInitialized) _gpuQueryFrameIndex++;
+
+            _drawsIssued     += _opaqueDrawCount + _transparentDrawCount;
+```
+
+- [ ] **Step 1.6: Update Dispose to delete the full ring (~line 1140)**
+
+**old_string:**
+```csharp
+        if (_gpuQueriesInitialized)
+        {
+            _gl.DeleteQuery(_gpuQueryOpaque);
+            _gl.DeleteQuery(_gpuQueryTransparent);
+        }
+```
+
+**new_string:**
+```csharp
+        if (_gpuQueriesInitialized)
+        {
+            for (int i = 0; i < GpuQueryRingDepth; i++)
+            {
+                _gl.DeleteQuery(_gpuQueryOpaque[i]);
+                _gl.DeleteQuery(_gpuQueryTransparent[i]);
+            }
+        }
+```
+
+- [ ] **Step 1.7: Build**
+
+Run from the worktree root:
+
+```powershell
+dotnet build
+```
+
+Expected: build succeeds with no new warnings or errors. If the build fails, the most likely cause is a missed string in one of the steps above — re-grep `_gpuQueryOpaque` and `_gpuQueryTransparent` in `WbDrawDispatcher.cs` and confirm every reference uses the array-indexed form `[gpuQuerySlot]` or `[i]`.
+
+- [ ] **Step 1.8: Run the test suite**
+
+```powershell
+dotnet test --no-build
+```
+
+Expected: same pass/fail baseline as before the change (~1688 passing, ~8 pre-existing physics/input failures unchanged). No new failures.
+
+- [ ] **Step 1.9: Manual verification — launch live and confirm `gpu_us` reports non-zero**
+
+```powershell
+$env:ACDREAM_DAT_DIR   = "$env:USERPROFILE\Documents\Asheron's Call"
+$env:ACDREAM_LIVE      = "1"
+$env:ACDREAM_TEST_HOST = "127.0.0.1"
+$env:ACDREAM_TEST_PORT = "9000"
+$env:ACDREAM_TEST_USER = "testaccount"
+$env:ACDREAM_TEST_PASS = "testpassword"
+$env:ACDREAM_WB_DIAG   = "1"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task1-verify.log"
+```
+
+In-world: walk Holtburg for ~30 seconds. Close the window when done.
+
+Verification check on `task1-verify.log`:
+
+```powershell
+Select-String -Path task1-verify.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 5
+```
+
+Expected output: at least one `[WB-DIAG]` line where `gpu_us=Xm/Yp95` has X > 0 (typically tens to low-hundreds of microseconds at radius=4-12 on a modern GPU). If `gpu_us=0m/0p95` persists for the entire run, the fix didn't take — check whether the build actually rebuilt (try `dotnet build -c Debug` then re-launch).
+
+Also confirm: no visible regression in the client. Entities render, animations play, sky cycles. Close the client cleanly.
+
+- [ ] **Step 1.10: Commit**
+
+```powershell
+git add src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs
+git commit -m @'
+feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher
+
+The dispatcher's GPU TimeElapsed queries were polled in the same frame
+as the indirect draw, so glGetQueryObject(ResultAvailable) always
+returned 0 and gpu_us in [WB-DIAG] was stuck at 0m/0p95.
+
+Replace the 2 single-handle queries with ring-of-3 arrays and move the
+result read to BEFORE issuing the next frame's queries into the same
+slot — at frame N we read slot N%3 which holds frame N-3's queries
+(oldest in the ring, ~50ms old at 60fps and definitely done across all
+desktop GL drivers). Vendor-neutral: AMD/NVIDIA/Intel desktop GL all
+work without driver-specific code.
+
+No new tests — the change is purely a diagnostic readout fix, no
+observable behavior in the rendering path. Manual verification:
+[WB-DIAG] now reports non-zero gpu_us at Holtburg radius=12.
+
+Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§4).
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+'@
+git status
+```
+
+Expected: clean working tree after commit. Note the new commit SHA — needed for the baseline doc's "measured against" reference.
+
+---
+
+## Task 2: Surface-format histogram dump path (part of commit 2 setup)
+
+**Files:**
+- Modify: `src/AcDream.App/Rendering/TextureCache.cs`
+- Modify: `src/AcDream.App/Rendering/GameWindow.cs`
+
+This task adds the env-gated one-shot dump infrastructure. It does NOT commit — the commit happens in Task 4 after the baseline document is also ready.
+
+- [ ] **Step 2.1: Add upload-time metadata tracking in `TextureCache.cs`**
+
+Add a new private dictionary that records `(width, height, formatLabel)` keyed by GL texture name. This lets `DumpSurfaceHistogram` emit dimension/format data without re-querying GL.
+
+Use Edit to insert the field right after the existing bindless cache fields (~line 41, just after `_bindlessByPalette`):
+
+**old_string:**
+```csharp
+    private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();
+
+    public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)
+```
+
+**new_string:**
+```csharp
+    private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();
+
+    // Phase N.6 slice 1 (2026-05-11): per-upload metadata for the
+    // ACDREAM_DUMP_SURFACES=1 histogram dump path. Populated at upload
+    // time so the dump method doesn't have to query GL state. Keyed by
+    // GL texture name (same key used in cache value tuples). Format
+    // label is "RGBA8_DECODED" for the post-decode upload (all uploads
+    // currently land as RGBA8 regardless of source format).
+    private readonly Dictionary<uint, (int Width, int Height, string Format)> _uploadMetadata = new();
+
+    // Frame counter for the one-shot ACDREAM_DUMP_SURFACES=1 trigger.
+    // Increments per Tick call; fires the dump once at frame index 600
+    // and never again for the session. See spec §5.
+    private int _dumpFrameCounter;
+    private bool _surfaceHistogramAlreadyDumped;
+
+    public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)
+```
+
+- [ ] **Step 2.2: Find the `UploadRgba8AsLayer1Array` method and record metadata there**
+
+Locate the method using Grep:
+
+```
+pattern: "UploadRgba8AsLayer1Array"
+path: src/AcDream.App/Rendering/TextureCache.cs
+output_mode: content
+-n: true
+```
+
+Read the method body (typically ~30-50 lines) to find the exact `return name;` line. The decoded texture has `decoded.Width`, `decoded.Height`, and `decoded.Rgba8` available.
+
+For each `return name;` in `UploadRgba8AsLayer1Array(DecodedTexture decoded)`, insert this line immediately before it:
+
+```csharp
+        _uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED");
+```
+
+If the method has only one `return name;` near its end, that's a single Edit. Use the surrounding 2-3 lines of context in `old_string` to make the Edit unique.
+
+- [ ] **Step 2.3: Also record metadata in the legacy `UploadRgba8` (non-bindless) path**
+
+Locate the method:
+
+```
+pattern: "private uint UploadRgba8\b"
+path: src/AcDream.App/Rendering/TextureCache.cs
+output_mode: content
+-n: true
+```
+
+Apply the same `_uploadMetadata[name] = (decoded.Width, decoded.Height, "RGBA8_DECODED");` insertion before each `return name;` in `UploadRgba8(DecodedTexture decoded)`. This ensures the dump captures both legacy and modern uploads.
+
+- [ ] **Step 2.4: Add the `TickSurfaceHistogramDumpIfEnabled` public method to `TextureCache.cs`**
+
+Locate `HashPaletteOverride` using Grep:
+
+```
+pattern: "internal static ulong HashPaletteOverride"
+path: src/AcDream.App/Rendering/TextureCache.cs
+output_mode: content
+-n: true
+-A: 20
+```
+
+Identify its closing brace. Use Edit with surrounding context to insert the new methods immediately after.
+
+**old_string:** (the last few lines of `HashPaletteOverride`):
+```csharp
+        foreach (var sp in p.SubPalettes)
+        {
+            h = (h ^ sp.SubPaletteId) * prime;
+            h = (h ^ sp.Offset) * prime;
+            h = (h ^ sp.Length) * prime;
+        }
+        return h;
+    }
+```
+
+**new_string:**
+```csharp
+        foreach (var sp in p.SubPalettes)
+        {
+            h = (h ^ sp.SubPaletteId) * prime;
+            h = (h ^ sp.Offset) * prime;
+            h = (h ^ sp.Length) * prime;
+        }
+        return h;
+    }
+
+    /// <summary>
+    /// Phase N.6 slice 1: one-shot surface-format histogram dump for the
+    /// atlas-opportunity audit. Activated by ACDREAM_DUMP_SURFACES=1; fires
+    /// once at frame 600 of the session (~10s at 60fps, ~3s at 200fps —
+    /// both well past streaming settle at radius≤12). Output goes to
+    /// %LOCALAPPDATA%\acdream\n6-surfaces.txt. Zero cost when off.
+    /// See spec §5 in docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
+    /// </summary>
+    public void TickSurfaceHistogramDumpIfEnabled()
+    {
+        if (_surfaceHistogramAlreadyDumped) return;
+        if (!string.Equals(Environment.GetEnvironmentVariable("ACDREAM_DUMP_SURFACES"), "1", StringComparison.Ordinal)) return;
+        _dumpFrameCounter++;
+        if (_dumpFrameCounter < 600) return;
+
+        DumpSurfaceHistogram();
+        _surfaceHistogramAlreadyDumped = true;
+    }
+
+    private void DumpSurfaceHistogram()
+    {
+        var localAppData = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData);
+        var outDir = System.IO.Path.Combine(localAppData, "acdream");
+        System.IO.Directory.CreateDirectory(outDir);
+        var outPath = System.IO.Path.Combine(outDir, "n6-surfaces.txt");
+
+        var sb = new System.Text.StringBuilder();
+        sb.AppendLine($"# acdream surface-format histogram — generated {DateTime.UtcNow:yyyy-MM-ddTHH:mm:ssZ}");
+        sb.AppendLine("# Per-entry: surfaceId(hex), width, height, format, byteCount");
+        sb.AppendLine();
+
+        // Walk every cached entry across the 6 caches, dedupe by GL name.
+        var seen = new HashSet<uint>();
+        long totalBytes = 0;
+        var bucketsByDim = new Dictionary<(int W, int H), int>();
+        var bucketsByFormat = new Dictionary<string, int>();
+        var bucketsByTriple = new Dictionary<(int W, int H, string F), int>();
+
+        void Emit(uint surfaceId, uint name)
+        {
+            if (!seen.Add(name)) return;
+            if (!_uploadMetadata.TryGetValue(name, out var meta)) return;
+            int bytes = meta.Width * meta.Height * 4;
+            totalBytes += bytes;
+            sb.AppendLine($"0x{surfaceId:X8}, {meta.Width}, {meta.Height}, {meta.Format}, {bytes}");
+
+            var dimKey = (meta.Width, meta.Height);
+            bucketsByDim[dimKey] = bucketsByDim.GetValueOrDefault(dimKey) + 1;
+            bucketsByFormat[meta.Format] = bucketsByFormat.GetValueOrDefault(meta.Format) + 1;
+            var tripleKey = (meta.Width, meta.Height, meta.Format);
+            bucketsByTriple[tripleKey] = bucketsByTriple.GetValueOrDefault(tripleKey) + 1;
+        }
+
+        foreach (var kv in _handlesBySurfaceId)         Emit(kv.Key, kv.Value);
+        foreach (var kv in _handlesByOverridden)        Emit(kv.Key.surfaceId, kv.Value);
+        foreach (var kv in _handlesByPalette)           Emit(kv.Key.surfaceId, kv.Value);
+        foreach (var kv in _bindlessBySurfaceId)        Emit(kv.Key, kv.Value.Name);
+        foreach (var kv in _bindlessByOverridden)       Emit(kv.Key.surfaceId, kv.Value.Name);
+        foreach (var kv in _bindlessByPalette)          Emit(kv.Key.surfaceId, kv.Value.Name);
+
+        sb.AppendLine();
+        sb.AppendLine("# Rollups");
+        sb.AppendLine($"# Total unique GL textures: {seen.Count}");
+        sb.AppendLine($"# Total bytes (sum of W*H*4): {totalBytes}");
+
+        sb.AppendLine("# Top 10 (W,H) dimension buckets:");
+        foreach (var kv in bucketsByDim.OrderByDescending(kv => kv.Value).Take(10))
+            sb.AppendLine($"#   {kv.Key.W}x{kv.Key.H}: {kv.Value}");
+
+        sb.AppendLine("# Format buckets:");
+        foreach (var kv in bucketsByFormat.OrderByDescending(kv => kv.Value))
+            sb.AppendLine($"#   {kv.Key}: {kv.Value}");
+
+        sb.AppendLine("# Top 10 (W,H,format) triples — atlas-opportunity input:");
+        foreach (var kv in bucketsByTriple.OrderByDescending(kv => kv.Value).Take(10))
+            sb.AppendLine($"#   {kv.Key.W}x{kv.Key.H} {kv.Key.F}: {kv.Value}");
+
+        System.IO.File.WriteAllText(outPath, sb.ToString());
+        Console.WriteLine($"[N6-DUMP] Surface histogram written to {outPath} ({seen.Count} textures, {totalBytes} bytes)");
+    }
+```
+
+- [ ] **Step 2.5: Confirm `using System.Linq;` is present in `TextureCache.cs`**
+
+Read the file's `using` section (top of file). If `using System.Linq;` is NOT present, add it. The `OrderByDescending` and `Take` calls in `DumpSurfaceHistogram` need it.
+
+Pattern:
+```
+pattern: "^using System\.Linq"
+path: src/AcDream.App/Rendering/TextureCache.cs
+output_mode: count
+```
+
+If count is 0, add `using System.Linq;` in alphabetical order with the other usings at the top of the file.
+
+- [ ] **Step 2.6: Add the per-frame call site in `GameWindow.cs`**
+
+Find a stable insertion point near the top of `OnRender` (starts at line 6288). Use Grep:
+
+```
+pattern: "_gl!\.Clear\("
+path: src/AcDream.App/Rendering/GameWindow.cs
+output_mode: content
+-n: true
+-A: 3
+```
+
+This finds the `Clear` call(s) in or near `OnRender`. The first one after line 6288 is where you want to insert. Read 5 lines of context around it, then Edit to insert the dump tick on the line immediately after the `Clear` call returns:
+
+The insertion (one Edit):
+
+**old_string:** (find the `Clear` call in `OnRender` and capture 1-2 lines of its context — varies; common pattern is `_gl!.Clear(ClearBufferMask.ColorBufferBit | ClearBufferMask.DepthBufferBit);` followed by the next line of `OnRender` work).
+
+**new_string:** the same `Clear` call followed by:
+```csharp
+
+        // Phase N.6 slice 1: one-shot surface-format histogram dump under
+        // ACDREAM_DUMP_SURFACES=1. Zero cost when off.
+        _textureCache?.TickSurfaceHistogramDumpIfEnabled();
+```
+
+If `OnRender` has multiple `Clear` calls, place the tick after the first one inside the method body. The call must run exactly once per frame, before any rendering work — placing it right after `Clear` accomplishes both.
+
+- [ ] **Step 2.7: Build**
+
+```powershell
+dotnet build
+```
+
+Expected: build succeeds with no new warnings. If a "name 'OrderByDescending' does not exist in current context" error appears, Step 2.5 was missed — add the `using System.Linq;` and rebuild.
+
+- [ ] **Step 2.8: Run the test suite**
+
+```powershell
+dotnet test --no-build
+```
+
+Expected: same pass/fail baseline (~1688 passing, ~8 pre-existing failures). No new failures.
+
+- [ ] **Step 2.9: Manual verification — confirm the dump file appears**
+
+Launch with the dump env var on:
+
+```powershell
+$env:ACDREAM_DUMP_SURFACES = "1"
+$env:ACDREAM_WB_DIAG = "1"
+# Other env vars same as Task 1 Step 1.9
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "task2-verify.log"
+```
+
+Wait ~15 seconds after the window appears, then close it. Check the file:
+
+```powershell
+Get-Content "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" | Select-Object -First 30
+```
+
+Expected: a non-empty file with the header, per-entry rows, and rollup sections. Also confirm one `[N6-DUMP] Surface histogram written to ...` line in `task2-verify.log` (just before window close).
+
+If the file is empty or missing:
+- Check the launch log for the `[N6-DUMP]` line.
+- If it's not there, `_dumpFrameCounter` didn't reach 600 — the user closed too early. Re-run and wait longer.
+- If it's there but the file lookup fails, the path output in the log should show what was actually written; investigate that path.
+
+**Do not commit yet.** Continue to Task 3.
+
+---
+
+## Task 3: Capture baseline measurements
+
+**Files:**
+- Create: `docs/plans/2026-05-11-phase-n6-perf-baseline.md` (final content lands in Task 4 — this task just collects the numbers).
+
+This is the manual measurement task. Each step launches the client, runs a specific scenario, and captures the diagnostic output. Save each log separately for the final write-up. Total expected time: ~30-45 min.
+
+Setup once per session:
+```powershell
+$env:ACDREAM_DAT_DIR   = "$env:USERPROFILE\Documents\Asheron's Call"
+$env:ACDREAM_LIVE      = "1"
+$env:ACDREAM_TEST_HOST = "127.0.0.1"
+$env:ACDREAM_TEST_PORT = "9000"
+$env:ACDREAM_TEST_USER = "testaccount"
+$env:ACDREAM_TEST_PASS = "testpassword"
+$env:ACDREAM_WB_DIAG   = "1"
+```
+
+For each measurement run, set `ACDREAM_STREAM_RADIUS` before launch. Use the `QualityPreset=High` default (no overrides). All runs at Holtburg with `+Acdream` at clear midday (cycle weather with F10 → Clear, time with F7 → Noon).
+
+Per run, after ~30 seconds at the target condition, close the window and grep the log for the last 3 `[WB-DIAG]` lines — those have the steady-state numbers.
+
+- [ ] **Step 3.1: Capture radius=4 standstill**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "4"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-stand.log"
+```
+
+In-world: enter world, do not move, hold position for 30 seconds. Close.
+
+```powershell
+Select-String -Path baseline-r4-stand.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3
+```
+
+Record from the median of the last 3 lines: `cpu_us`, `gpu_us`, `entSeen`, `entDrawn`, `groups`. Also note the window-title FPS shown during the test.
+
+- [ ] **Step 3.2: Capture radius=4 walking**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "4"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r4-walk.log"
+```
+
+In-world: enter world, Tab to player mode, walk N→E→S→W across one landblock over ~30 seconds. Close.
+
+Capture same numbers as 3.1.
+
+- [ ] **Step 3.3: Capture radius=8 standstill**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "8"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-stand.log"
+```
+
+Same procedure as 3.1. Wait ~40 seconds before recording (streaming takes longer to settle).
+
+- [ ] **Step 3.4: Capture radius=8 walking**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "8"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r8-walk.log"
+```
+
+Same procedure as 3.2.
+
+- [ ] **Step 3.5: Capture radius=12 standstill**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "12"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-stand.log"
+```
+
+Same procedure as 3.1. Wait ~60 seconds before recording. This is the headline measurement — pay attention to whether `gpu_us` p95 is well below 16.6 ms (60 fps target) or pushing it.
+
+- [ ] **Step 3.6: Capture radius=12 walking**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "12"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-r12-walk.log"
+```
+
+Same procedure as 3.2 (walking across one landblock, ~30 seconds of motion within the 60s+ window).
+
+- [ ] **Step 3.7: Capture the surface histogram**
+
+```powershell
+$env:ACDREAM_STREAM_RADIUS = "12"
+$env:ACDREAM_DUMP_SURFACES = "1"
+dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline-surfaces.log"
+```
+
+In-world: enter world at Holtburg, do nothing for ~30 seconds (let the dump fire at frame 600). Close. Copy the file:
+
+```powershell
+Copy-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -Destination "baseline-surfaces.txt"
+```
+
+Inspect:
+```powershell
+Get-Content baseline-surfaces.txt | Select-Object -Last 40
+```
+
+Record the rollup section (total textures, total bytes, top 10 dimension buckets, format distribution, top 10 (W,H,format) triples).
+
+- [ ] **Step 3.8: Clean up the env vars and the local app data dump**
+
+```powershell
+Remove-Item Env:\ACDREAM_DUMP_SURFACES -ErrorAction SilentlyContinue
+Remove-Item Env:\ACDREAM_STREAM_RADIUS -ErrorAction SilentlyContinue
+# Optional: clean up the source file so a future re-measurement isn't confused by stale data
+Remove-Item "$env:LOCALAPPDATA\acdream\n6-surfaces.txt" -ErrorAction SilentlyContinue
+```
+
+All log files (`baseline-r*-*.log`, `baseline-surfaces.log`, `baseline-surfaces.txt`) remain in the worktree root for Task 4. They will NOT be committed — they're scratch.
+
+---
+
+## Task 4: Write baseline doc + amend roadmap + ship commit 2
+
+**Files:**
+- Create: `docs/plans/2026-05-11-phase-n6-perf-baseline.md`
+- Modify: `docs/plans/2026-04-11-roadmap.md` lines 690-705
+
+- [ ] **Step 4.1: Write the baseline document**
+
+Use Write to create `docs/plans/2026-05-11-phase-n6-perf-baseline.md` with this content (substitute real numbers from Task 3 captures into every `<n>` and `<pct>` placeholder; do NOT leave any unfilled):
+
+```markdown
+# Phase N.6 slice 1 — perf baseline at Holtburg
+
+**Created:** 2026-05-11.
+**Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md)
+**Measured against commit:** <commit SHA from Task 1.10>
+**Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.
+
+---
+
+## §1. Setup
+
+- **Hardware:** Radeon RX 9070 XT
+- **Resolution:** 1440p (2560×1440)
+- **Quality preset:** High (default)
+- **Connection:** live ACE at `127.0.0.1:9000`
+- **Character:** `+Acdream` at Holtburg
+- **Sky / time:** clear midday (F7 → Noon, F10 → Clear)
+- **Build:** Debug
+- **Date measured:** 2026-05-11
+- **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=<per-run>`
+
+## §2. Dispatch CPU / GPU numbers
+
+Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window. `entSeen / entDrawn / groups` are also from those lines. FPS read from the window title.
+
+| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | FPS | entSeen | entDrawn | groups |
+|---|---|---|---|---|---|---|---|---|---|
+| 4 | standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
+| 4 | walking    | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
+| 8 | standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
+| 8 | walking    | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
+| 12| standstill | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
+| 12| walking    | <n> | <n> | <n> | <n> | <n> | <n> | <n> | <n> |
+
+## §3. Surface-format histogram
+
+From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world.
+
+- **Total unique GL textures:** <n>
+- **Total bytes (sum of W*H*4):** <n>
+- **Top 10 (W, H) dimension buckets:**
+  - `<W>x<H>`: <count>
+  - ... (paste from baseline-surfaces.txt rollup)
+- **Format distribution:**
+  - `<format>`: <count>
+- **Top 10 (W, H, format) triples — atlas-opportunity input:**
+  - `<W>x<H> <format>`: <count>
+  - ...
+
+**Atlas-opportunity score:** <pct>% of surfaces fall into the top-3 (W, H, format) triples. (A score >30% means atlas consolidation could meaningfully reduce sampler switches + memory overhead; <15% means scattered content and atlas is not worth the slice-2 effort.)
+
+## §4. Conclusion + next-phase recommendation
+
+<Opinionated paragraph addressing:
+ 1. Is the entity dispatcher CPU-bound or GPU-bound at radius=12?
+    - Compare cpu_us p95 vs gpu_us p95. The larger one is the bottleneck.
+ 2. Does gpu_us p95 leave headroom at 60 fps target (16.6 ms / 16600 µs)?
+    - If gpu_us p95 < 8000 µs: comfortable headroom.
+    - If gpu_us p95 < 14000 µs: tight but OK.
+    - If gpu_us p95 >= 14000 µs: GPU-saturated, persistent-mapped buffers and compute cull help.
+ 3. Does the atlas score justify slice-2 atlas work?
+ 4. Given (1)-(3), which is the right next phase?
+    - CPU-bound + low atlas score: pivot to C.1.5 (visible content, perf already comfortable).
+    - GPU-bound + high atlas score: do N.6 slice 2 (atlas + persistent buffers).
+    - Either-bound + headroom + low atlas score: do C.1.5 first.
+    - GPU saturated + need for more headroom: escalate to Tier 2.>
+
+## §5. Raw logs
+
+Scratch logs from this measurement run (not committed):
+- `baseline-r4-stand.log`, `baseline-r4-walk.log`
+- `baseline-r8-stand.log`, `baseline-r8-walk.log`
+- `baseline-r12-stand.log`, `baseline-r12-walk.log`
+- `baseline-surfaces.log`, `baseline-surfaces.txt`
+```
+
+Fill in every `<n>` and `<pct>` and the conclusion paragraph with the real values from Task 3. **Do NOT leave any `<n>` placeholders.** If a measurement is missing, re-run that step from Task 3 before continuing.
+
+- [ ] **Step 4.2: Read the current roadmap N.6 entry**
+
+```
+Read offset 685, limit 25 from docs/plans/2026-04-11-roadmap.md
+```
+
+Confirm the bullet starts with `- **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).**` and ends with `Plan + spec written when work begins. **Estimate: 1-2 weeks.**`. Capture the exact text verbatim for Step 4.3's `old_string`.
+
+- [ ] **Step 4.3: Amend the roadmap entry**
+
+Use Edit. The change splits N.6 into slice 1 (shipping with this commit) and slice 2 (deferred until after C.1.5).
+
+**old_string:** the exact N.6 bullet copied from the Read in Step 4.2.
+
+**new_string:**
+```markdown
+- **N.6 slice 1 — GPU timing fix + radius=12 perf baseline.** **SHIPPED 2026-05-11.**
+  Fixed the gpu_us double-buffering bug in `WbDrawDispatcher` (ring-of-3
+  query slots, read-before-overwrite, vendor-neutral across AMD/NVIDIA/Intel
+  desktop GL). Added env-gated surface-format histogram dump in `TextureCache`
+  for atlas-opportunity audit. Captured authoritative baseline at Holtburg
+  radii 4 / 8 / 12 (standstill + walking) with the now-working `gpu_us`
+  diagnostic. Plan + spec at `docs/superpowers/{specs,plans}/2026-05-11-phase-n6-slice1-*.md`.
+  Baseline numbers + next-phase recommendation at
+  [docs/plans/2026-05-11-phase-n6-perf-baseline.md](2026-05-11-phase-n6-perf-baseline.md).
+- **N.6 slice 2 — Perf polish cleanup.** **Planned — deferred until after C.1.5
+  (PES emitter wiring) per the baseline doc's recommendation.** Builds on
+  slice 1's measurement. Scope: retire the legacy `Texture2D`/`sampler2D` path
+  in `TextureCache` (currently kept for Sky + Debug + particle paths now that
+  Terrain has migrated); delete orphan `mesh.frag` (verify zero callers post-N.5
+  amendment); decide bindless-everywhere vs legacy-island for the remaining
+  `sampler2D` consumers; conditionally adopt WB atlas if the slice-1 histogram
+  shows a real opportunity; conditionally adopt persistent-mapped buffers if
+  the slice-1 baseline shows `BufferSubData` as a hot spot; GPU compute culling
+  remains out-of-scope (that's Tier 3 of the perf-tiers roadmap, gated on
+  Tier 2 first). Plan + spec written when work begins. **Estimate: 1-2 weeks
+  once C.1.5 lands.**
+```
+
+- [ ] **Step 4.4: Build (sanity check — only docs touched, but be safe)**
+
+```powershell
+dotnet build
+```
+
+Expected: build succeeds. (No code touched in Task 4; this just confirms nothing was accidentally edited in src/.)
+
+- [ ] **Step 4.5: Commit 2**
+
+```powershell
+git add src/AcDream.App/Rendering/TextureCache.cs `
+        src/AcDream.App/Rendering/GameWindow.cs `
+        docs/plans/2026-05-11-phase-n6-perf-baseline.md `
+        docs/plans/2026-04-11-roadmap.md
+git commit -m @'
+docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path
+
+Capture authoritative CPU+GPU dispatch numbers at Holtburg with the
+gpu_us diagnostic now working (commit <prev SHA from Task 1.10>). Three
+radii (4/8/12) × two motion modes (standstill/walking) + a surface-format
+histogram from ACDREAM_DUMP_SURFACES=1.
+
+Adds env-gated one-shot dump path (TextureCache.TickSurfaceHistogramDumpIfEnabled,
+called from GameWindow.OnRender) that fires once at frame 600 of the
+session — zero cost when off, writes to %LOCALAPPDATA%\acdream\n6-surfaces.txt.
+
+Baseline document at docs/plans/2026-05-11-phase-n6-perf-baseline.md
+closes with a recommendation paragraph for the next phase. Roadmap entry
+amended to reflect the slice 1 / slice 2 split.
+
+Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md (§5, §6).
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+'@
+git status
+```
+
+Expected: clean working tree.
+
+- [ ] **Step 4.6: Final sanity sweep**
+
+```powershell
+git log -3 --oneline
+```
+
+Expected: two new commits from this slice (the GPU timing fix from Task 1.10, then this docs/perf commit), under the spec commit `05d590c`.
+
+Also confirm the scratch baseline-r*.log and baseline-surfaces.* files are still NOT in the commit (they were not staged):
+
+```powershell
+git status
+```
+
+Expected: clean working tree. If the scratch logs show as untracked but uncommitted, that's fine — they can be deleted manually:
+
+```powershell
+Remove-Item baseline-r*.log, baseline-surfaces.log, baseline-surfaces.txt, task1-verify.log, task2-verify.log -ErrorAction SilentlyContinue
+```
+
+---
+
+## Acceptance check (spec §9)
+
+After Task 4 commits, walk through the spec's acceptance criteria and confirm each one. This is a paper-walk, not a re-run — the steps above produce the conditions.
+
+- [ ] **A1: `[WB-DIAG]` reports non-zero `gpu_us` at radius=12.**
+  Verified in Task 1.9 (initial check) and Task 3.5-3.6 (full baseline run). Confirm by re-grepping `baseline-r12-stand.log`:
+  ```powershell
+  Select-String -Path baseline-r12-stand.log -Pattern "gpu_us=[1-9]"
+  ```
+  Should return at least one line.
+
+- [ ] **A2: Vendor-neutral.** No `GL_*_NV` or `GL_*_AMD` or `GL_*_INTEL` extension references in the change. Re-grep:
+  ```powershell
+  Select-String -Path src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs -Pattern "NV_|AMD_|INTEL_|GL_NV|GL_AMD|GL_INTEL"
+  ```
+  Expected: no matches in the new code (matches elsewhere in the file from unrelated existing code don't count).
+
+- [ ] **A3: Baseline doc has real numbers + conclusion.**
+  Open `docs/plans/2026-05-11-phase-n6-perf-baseline.md` and visually confirm no `<n>`, `<pct>`, `TBD`, or empty conclusion section.
+
+- [ ] **A4: Roadmap split shipped.**
+  ```powershell
+  Select-String -Path docs/plans/2026-04-11-roadmap.md -Pattern "N\.6 slice"
+  ```
+  Expected: two matches (slice 1 + slice 2 bullets).
+
+- [ ] **A5: `dotnet build` green, no new warnings.**
+  ```powershell
+  dotnet build
+  ```
+  Expected: succeeds. Note any new warnings vs the build output before the slice started.
+
+- [ ] **A6: `dotnet test` green at baseline (~1688 passing, ~8 pre-existing failures).**
+  ```powershell
+  dotnet test --no-build
+  ```
+  Expected: pass count unchanged from before the slice started; failure list unchanged.
+
+- [ ] **A7: No visible regression.**
+  Confirmed during Task 1.9 and Task 3 measurements — the user was in-world repeatedly and didn't observe any rendering issue. If anything looked off during measurement, file it as an issue and decide whether it blocks slice 1 acceptance.
+
+If any acceptance criterion fails, return to the relevant task and re-do it. Do not declare slice 1 complete with failing acceptance.
+
+---
+
+## After slice 1 lands
+
+The baseline document's conclusion paragraph (§4) determines the next phase:
+
+- **If conclusion recommends C.1.5:** brainstorm C.1.5 spec next, using [docs/plans/2026-04-27-phase-c1-pes-particles.md:285-295](../../plans/2026-04-27-phase-c1-pes-particles.md) as the starting scope.
+- **If conclusion recommends N.6 slice 2:** brainstorm slice 2 spec next, addressing legacy `TextureCache` cleanup + atlas + persistent-mapped buffers based on the histogram data.
+- **If conclusion recommends Tier 2:** consult [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md) and brainstorm a Tier 2 spec.
+
+The choice is data-driven; the recommendation paragraph is the contract. Don't re-litigate the decision once the numbers are in.
--- a/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
+++ b/docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
@ -0,0 +1,335 @@
+# Phase N.6 slice 1 — GPU timing fix + radius=12 perf baseline (design)
+
+**Created:** 2026-05-11.
+**Status:** approved design, ready for implementation plan.
+**Phase context:** Phase N.6 (perf polish) split into two slices on 2026-05-11 — this is slice 1. Slice 2 (legacy `TextureCache` cleanup + shader migration + optional persistent-mapped buffers) is deferred until after C.1.5 (PES emitter wiring), and gets its own spec then.
+**Roadmap entry:** [docs/plans/2026-04-11-roadmap.md](../../plans/2026-04-11-roadmap.md) lines 690-705 (to be amended in commit 2 to reflect the slice split).
+
+---
+
+## §1. Problem
+
+`WbDrawDispatcher` runs `glBeginQuery(GL_TIME_ELAPSED, …) … glEndQuery` around the opaque and transparent indirect draws, then immediately polls `glGetQueryObject(…, ResultAvailable, …)` **on the same frame** to read the result. The GPU has not finished executing the draw by the time the polling call runs, so `avail` is always 0, the sample is dropped, and the `_gpuSamples` ring stays all-zero forever. The user sees `gpu_us=0m/0p95` in every `[WB-DIAG]` line under `ACDREAM_WB_DIAG=1`.
+
+Verified at [src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs:849-859](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs#L849).
+
+Without this fix:
+- Every future perf decision (Tier 2 vs Tier 3 vs slice 2 vs do-nothing) is made on CPU-only data.
+- We cannot tell whether the dispatcher is CPU-bound or GPU-bound at radius=12.
+- We cannot validate that N.5/N.5b/Tier 1 changes actually moved GPU time.
+
+This slice ships the GPU-timing fix and uses the now-working diagnostic to produce one authoritative perf baseline document so the next phase decision (slice 2 vs C.1.5 vs Tier 2/3) is data-driven.
+
+---
+
+## §2. Goals and non-goals
+
+### Goals
+
+1. `[WB-DIAG]` reports non-zero `gpu_us` for the entity dispatcher's opaque+transparent passes at Holtburg radius=12 with `ACDREAM_WB_DIAG=1`.
+2. The fix works on AMD, NVIDIA, and Intel desktop OpenGL drivers without vendor-specific code paths.
+3. Produce a baseline document at `docs/plans/2026-05-11-phase-n6-perf-baseline.md` with CPU and GPU numbers across radii 4 / 8 / 12 (standstill + walking), a surface-format histogram, and a memory snapshot.
+4. The baseline document closes with a recommendation paragraph: should the next phase be N.6 slice 2 (perf cleanup), C.1.5 (PES wiring), or escalation to Tier 2 (static/dynamic split). Rationale grounded in the captured numbers.
+5. `dotnet build` and `dotnet test` green; no functional regression in the rendering path.
+
+### Non-goals
+
+- Persistent-mapped buffers (`BufferSubData` → `GL_MAP_PERSISTENT_BIT`). Deferred to slice 2 unless the baseline shows it's a hot spot.
+- Legacy `TextureCache` cleanup, `mesh.frag` orphan deletion, sky/UI text shader migration to bindless. All deferred to slice 2.
+- WB atlas adoption / texture-array consolidation. Deferred to slice 2 pending the surface histogram from goal 3.
+- Adding GPU queries to terrain / sky / particle / debug-line passes. Slice 1 keeps query scope to the existing two queries inside `WbDrawDispatcher` (opaque-pass + transparent-pass).
+- GPU compute culling. That's Tier 3 of [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md), separate roadmap.
+
+---
+
+## §3. Design decisions (from brainstorming, 2026-05-11)
+
+| # | Decision | Rationale |
+|---|---|---|
+| Q1 | **Ring of 3 query-pair slots** (not ring of 2) | Vendor-neutral. NVIDIA drivers with triple-buffering + vsync can queue ~3 frames ahead; AMD typically 1–2; Intel iGPUs vary. Ring of 2 plus `ResultAvailable` guard works everywhere but drops more samples on deeper queues. Ring of 3 collects samples reliably across all desktop drivers. Cost: one extra `GLuint` query pair (~12 bytes of GPU state) plus one frame of latency on the printed value, which is invisible because the diagnostic is a 256-frame moving-window median. |
+| Q2 | **Read-before-issue, same-slot pattern** | On frame N, attempt to read slot `N%3` (which contains frame N-3's result — the *oldest* unread data, ~50 ms ago at 60 fps) *before* overwriting it with frame N's queries. Reading the oldest data maximizes the chance that `ResultAvailable=1` across all desktop drivers. Use `ResultAvailable` as a guard — if not ready, skip the sample. `MedianMicros` already computes over the non-zero subset, so dropped samples don't poison the result. |
+| Q3 | **Keep query scope unchanged** — just the two existing queries (opaque-pass + transparent-pass for the WB dispatcher) | Slice 1 is "fix what's broken," not "expand instrumentation." Adding terrain / sky / particle queries is slice-2-or-later work and would inflate this slice past the half-day budget. |
+| Q4 | **Surface-format histogram via env-gated one-shot dump** (`ACDREAM_DUMP_SURFACES=1`) | The atlas-adoption decision in slice 2 needs to know whether enough surfaces share dimensions/format to make consolidation worthwhile. A one-time dump on first frame to a fixed file path is cheap to implement, zero cost when off, and lets the user re-run cheaply when needed. Output goes to `%LOCALAPPDATA%\acdream\n6-surfaces.txt` (not stdout) to avoid spamming the launch log. |
+| Q5 | **Two commits, not one** | Commit 1 is the GPU-timing fix (code change, regression-bisectable). Commit 2 is the surface-dump path + baseline document (docs + env-gated diag). Keeping them separate means a future bisect for a GPU-timing regression doesn't land on a doc commit. |
+| Q6 | **Baseline measurement is Holtburg + High preset only** (per the user's hardware) | Slice 1 doesn't pretend to be a cross-hardware perf survey. It's one canonical measurement on the dev machine. The document template captures setup explicitly so a NVIDIA / lower-end run can be added later without re-architecting the doc. |
+
+---
+
+## §4. Change 1 — GPU query double-buffering
+
+### Files touched
+
+- `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` — single-file change, ~30 LOC delta.
+
+### Current state (verified)
+
+```csharp
+// Field declarations near line 155:
+private uint _gpuQueryOpaque;
+private uint _gpuQueryTransparent;
+private readonly long[] _gpuSamples = new long[256];
+private bool _gpuQueriesInitialized;
+
+// Init at line ~347:
+if (diag && !_gpuQueriesInitialized) {
+    _gpuQueryOpaque      = _gl.GenQuery();
+    _gpuQueryTransparent = _gl.GenQuery();
+    _gpuQueriesInitialized = true;
+}
+
+// Around the opaque draw at line ~774:
+if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque);
+… opaque indirect draw …
+if (diag && _gpuQueriesInitialized) _gl.EndQuery(QueryTarget.TimeElapsed);
+
+// Same pattern around transparent draw at line ~823.
+
+// Read at line ~849 — BUG: same frame, never ready:
+if (_gpuQueriesInitialized) {
+    _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail);
+    if (avail != 0) {
+        _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs);
+        _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs);
+        long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
+        _gpuSamples[_gpuSampleCursor] = gpuUs;
+        _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
+    }
+}
+
+// Dispose at line ~1140:
+if (_gpuQueriesInitialized) {
+    _gl.DeleteQuery(_gpuQueryOpaque);
+    _gl.DeleteQuery(_gpuQueryTransparent);
+}
+```
+
+### Target state
+
+```csharp
+private const int GpuQueryRingDepth = 3;
+private readonly uint[] _gpuQueryOpaque      = new uint[GpuQueryRingDepth];
+private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth];
+private int _gpuQueryFrameIndex;  // increments every frame we issue queries
+private bool _gpuQueriesInitialized;
+
+// Init:
+if (diag && !_gpuQueriesInitialized) {
+    for (int i = 0; i < GpuQueryRingDepth; i++) {
+        _gpuQueryOpaque[i]      = _gl.GenQuery();
+        _gpuQueryTransparent[i] = _gl.GenQuery();
+    }
+    _gpuQueriesInitialized = true;
+}
+
+// Compute the slot index for this frame. We read this slot's previous
+// contents (frame N-3's queries — the oldest data in the ring) and then
+// overwrite it with this frame's queries.
+int slot = _gpuQueryFrameIndex % GpuQueryRingDepth;
+
+// Read frame N-3's result BEFORE overwriting. Gated on "we've completed
+// at least one full ring of writes" so we don't read uninitialized slots
+// during warm-up.
+if (_gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth) {
+    _gl.GetQueryObject(_gpuQueryOpaque[slot], QueryObjectParameterName.ResultAvailable, out int avail);
+    if (avail != 0) {
+        _gl.GetQueryObject(_gpuQueryOpaque[slot],      QueryObjectParameterName.Result, out ulong opaqueNs);
+        _gl.GetQueryObject(_gpuQueryTransparent[slot], QueryObjectParameterName.Result, out ulong transNs);
+        long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
+        _gpuSamples[_gpuSampleCursor] = gpuUs;
+        _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
+    }
+    // If avail==0 the sample is dropped silently. MedianMicros already
+    // computes over the non-zero subset, so dropped samples don't poison
+    // the median.
+}
+
+// Issue this frame's queries into the same slot — overwriting the data
+// we just (attempted to) read.
+if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[slot]);
+… opaque indirect draw …
+if (diag && _gpuQueriesInitialized) _gl.EndQuery(QueryTarget.TimeElapsed);
+
+… same for transparent with _gpuQueryTransparent[slot] …
+
+_gpuQueryFrameIndex++;
+
+// Dispose: loop over the ring.
+```
+
+### Behavior
+
+- Frames 0, 1, 2 issue queries but no reads happen (the `>= RingDepth` gate skips them).
+- Frame 3 reads frame 0's queries (oldest in ring) and writes new queries into slot 0. Frame 4 reads frame 1's, etc.
+- Steady-state: each frame's queries are read exactly once, three frames after they were issued. Frames 0/1/2's queries are intentionally lost (startup artifact, ~50 ms of measurement).
+- The diagnostic prints over a 256-frame moving window — at 200 fps that's ~1.3 s of history, so the first valid `gpu_us` median appears within ~2 s of moving.
+
+### Diag interaction
+
+`MaybeFlushDiag` already prints every 5 s; no change there.
+
+`MedianMicros` already filters non-zero samples; no change there.
+
+The user-visible behavior change: `gpu_us=Xm/Yp95` numbers in `[WB-DIAG]` reflect real GPU draw time for the entity dispatcher's two indirect calls.
+
+---
+
+## §5. Change 2 — Surface-format histogram one-shot dump
+
+### Files touched
+
+- `src/AcDream.App/Rendering/TextureCache.cs` — add an env-gated dump method, ~40 LOC.
+- One caller in `GameWindow.cs` (first-frame hook) — ~5 LOC.
+
+### Trigger
+
+Env var `ACDREAM_DUMP_SURFACES=1`. When set, on **frame index 600** of the session (~10 s at 60 fps, ~3 s at 200 fps — both well past streaming settle at radius≤12), iterate all entries in the bindless caches (`_bindlessBySurfaceId`, `_bindlessByOverridden`, `_bindlessByPalette`) and emit a histogram to `%LOCALAPPDATA%\acdream\n6-surfaces.txt`. One-shot — fires once per session at the exact frame, no repeats. The user can re-launch to capture a fresh snapshot.
+
+### Output schema
+
+Per entry, one line: `surfaceId(uint32 hex), width(uint16), height(uint16), format(string), byteCount(uint32)`.
+
+Plus rollups at the end:
+- Count by `(width × height)` bucket — answers "how many distinct dimension pairs?".
+- Count by source `SurfaceFormat` (INDEX16, BGRA, DXT1, etc.).
+- Total bytes (sum of `width × height × 4` for RGBA8 uploads).
+- Top 10 most-shared `(width, height, format)` triples by count — this is the atlas-opportunity input.
+
+### Cost when off
+
+Negligible — one `Dictionary<uint, …>` write per `UploadRgba8`/`UploadRgba8AsLayer1Array` call (the `_uploadMetadata` insertion is unconditional so the dump path doesn't have to query GL state when it does fire). At Holtburg with 760 textures that's ~30–50 KB of process memory and one hash-table write per upload — invisible at runtime, no GC pressure. The expensive work (file I/O, histogram construction) is gated by the env-var check inside `TickSurfaceHistogramDumpIfEnabled` and only runs when `ACDREAM_DUMP_SURFACES=1`.
+
+---
+
+## §6. Change 3 — Baseline document
+
+### File
+
+`docs/plans/2026-05-11-phase-n6-perf-baseline.md`.
+
+### Setup section
+
+- Hardware: Radeon RX 9070 XT (the user's machine).
+- Resolution: 1440p.
+- Quality preset: High (default).
+- Connection: live ACE at `127.0.0.1:9000`, character `+Acdream` at Holtburg.
+- Sky: clear midday, controlled via `F7` to remove weather noise.
+- Build: Debug (matches the user's normal launch).
+- Date measured: 2026-05-11.
+
+### Measurements
+
+Three radii: 4, 8, 12. Two motion modes per radius: standstill (camera anchored 30 s) and walking (`+Acdream` walks N→E→S→W across one landblock, 30 s).
+
+Per radius/mode, capture from `[WB-DIAG]` and the window title:
+- CPU dispatcher: `cpu_us` median, p95.
+- GPU dispatcher: `gpu_us` median, p95 (now real).
+- FPS.
+- Entities seen / drawn.
+- Groups.
+- Frame time (window title).
+
+### Memory snapshot
+
+One-time output from the `ACDREAM_DUMP_SURFACES=1` run, summarized:
+- Total surfaces in cache.
+- Total GPU texture bytes.
+- Dimension distribution (top 10 by count).
+- Format distribution.
+- Atlas-opportunity score: percentage of surfaces in the top-3 dimension buckets.
+
+### Conclusion section
+
+A recommendation paragraph addressing:
+1. Is the entity dispatcher CPU-bound or GPU-bound at radius=12?
+2. Does `gpu_us` p95 leave headroom or is the GPU saturated?
+3. Does the atlas-opportunity score justify slice-2 atlas work?
+4. Given (1)–(3), what should the next phase be? Slice 2 (perf cleanup), C.1.5 (PES emitter wiring), or escalation to Tier 2 (static/dynamic split)?
+
+The paragraph is opinionated — the next phase decision should be obvious from the numbers, not require a separate debate.
+
+---
+
+## §7. Test plan
+
+### Automated tests (none new)
+
+This slice is intentionally test-light:
+- The GPU-timing fix has no observable behavior in tests — it only changes a diagnostic readout. No new unit tests.
+- The surface-dump path is env-gated diag; no need to lock its output format in tests.
+- Existing 1688 tests must remain green. `WbDrawDispatcher` tests (bucketing, indirect-command construction, classification cache) must not be perturbed.
+
+### Manual verification
+
+1. Launch live with `ACDREAM_WB_DIAG=1`. Walk Holtburg for ~30 s. Confirm `[WB-DIAG]` prints `gpu_us=Xm/Yp95` with X > 0 within ~5 s.
+2. Launch live with `ACDREAM_DUMP_SURFACES=1 ACDREAM_WB_DIAG=1`. Wait ~10 s for streaming to settle. Open `%LOCALAPPDATA%\acdream\n6-surfaces.txt`. Confirm it contains a non-empty histogram.
+3. Run the baseline measurement procedure end-to-end. Confirm the document populates with real numbers, not placeholders.
+
+---
+
+## §8. Sequencing / ship gates
+
+### Commit 1 — GPU query fix
+
+**Message:** `feat(perf): Phase N.6 slice 1 — fix gpu_us double-buffering in WbDrawDispatcher`
+
+**Scope:** `WbDrawDispatcher.cs` changes only. Build green, tests green, manual verification step 1 from §7 passes.
+
+**Gate:** if `gpu_us` still reports 0 after ~10 s of movement, do NOT proceed to commit 2. Bump ring depth to 4 or investigate driver behavior before continuing.
+
+### Commit 2 — Baseline doc + surface dump
+
+**Message:** `docs(perf): Phase N.6 slice 1 — radius=12 baseline + surface dump path`
+
+**Scope:** `TextureCache.cs` dump method, `GameWindow.cs` hook, `docs/plans/2026-05-11-phase-n6-perf-baseline.md`, and the roadmap amendment at `docs/plans/2026-04-11-roadmap.md` lines 690-705 (split N.6 into slice 1 / slice 2 in the bullet list).
+
+**Gate:** manual verification steps 2 and 3 from §7 pass; baseline document's conclusion paragraph is filled in (not "TBD"); roadmap update lands in the same commit.
+
+---
+
+## §9. Acceptance criteria
+
+1. `[WB-DIAG]` reports non-zero `gpu_us` for the entity dispatcher's opaque+transparent passes at Holtburg radius=12 with `ACDREAM_WB_DIAG=1`.
+2. The fix uses only core OpenGL 3.3+ features (`GL_TIME_ELAPSED`, `glGetQueryObject`, `GL_QUERY_RESULT_AVAILABLE`). No vendor-specific extensions.
+3. `docs/plans/2026-05-11-phase-n6-perf-baseline.md` exists, contains numbers (not placeholders) for the 3 radii × 2 motion modes, contains the surface histogram summary, and closes with a recommendation paragraph.
+4. The roadmap entry at `docs/plans/2026-04-11-roadmap.md:690-705` is amended to reflect the slice split.
+5. `dotnet build` succeeds with no new warnings.
+6. `dotnet test` succeeds with the existing pass/fail baseline (1688 passing, ~8 pre-existing physics/input failures unchanged).
+7. No visible regression in the rendering path — Holtburg outdoor, day/night cycle, entity rendering, transparent surfaces all look the same as before the change.
+
+---
+
+## §10. Risks
+
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| `ResultAvailable` is 0 even for frame N-3 (driver queues 4+ frames ahead) | Low — would be unusual on desktop GL | Sample is dropped silently; diagnostic prints zeros; user reports it. Fix: bump `GpuQueryRingDepth` to 4. No regression in the render path itself. |
+| Query-pair allocation leaks across init/Dispose cycles | Low | Dispose loop deletes the full ring; existing pattern just gains an array index. |
+| Surface-dump path fires before streaming settles, gets a sparse picture | Medium | Document the procedure as "wait ~10 s after entering world before reading the file." The dump path itself can also be re-runnable if needed (deferred unless slice 1 hits this in practice). |
+| Conclusion paragraph in the baseline document is hard to write because the numbers don't clearly favor one direction | Medium — this is the slice's whole purpose | Acknowledge the ambiguity in the document and propose a "slice 1 conclusion plus a short re-brainstorm with the user" flow. The slice still ships if the numbers force a re-brainstorm; the value is in having the numbers, not in pre-deciding the answer. |
+| Hidden vendor-specific behavior in `GL_TIME_ELAPSED` produces non-comparable numbers across hardware | Low — `GL_TIME_ELAPSED` is nanosecond-accurate per spec | Document the measurement hardware explicitly in the baseline doc setup section so future runs on different GPUs can be tagged appropriately. |
+
+---
+
+## §11. Out of scope / future work
+
+These are explicitly NOT in slice 1, listed here so the next phase has a clean shopping list:
+
+- **Slice 2 — `TextureCache` cleanup.** Delete orphan `mesh.frag` (verify zero callers post-N.5 amendment). Delete dead entity-style legacy caches (`_handlesByOverridden`, `_handlesByPalette`) that no live renderer reads. Decide on bindless-everywhere vs legacy-island for the remaining `sampler2D` consumers (sky, UI text, particles).
+- **Slice 2 — Particle shader migration.** Tied to C.1.5 outcome; particles migrate after C.1.5 lands more visible content to regression-test against.
+- **Slice 2 — Persistent-mapped buffers.** Conditional on slice 1's baseline showing `BufferSubData` as a hot spot.
+- **Slice 2 — WB atlas adoption.** Conditional on slice 1's surface histogram showing a real opportunity.
+- **C.1.5 — PES emitter wiring.** Portals, chimneys, fireplaces. Separate phase; gets its own brainstorm/spec.
+- **Tier 2 — static/dynamic split with persistent groups.** Separate roadmap at [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md).
+- **Tier 3 — GPU compute culling.** Depends on Tier 2 first. Same roadmap.
+- **Cross-vendor perf comparison.** Slice 1 is one machine. A NVIDIA companion run is a backlog item, not in scope.
+
+---
+
+## §12. References
+
+- Existing dispatcher code: [src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs](../../../src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs).
+- Existing texture cache: [src/AcDream.App/Rendering/TextureCache.cs](../../../src/AcDream.App/Rendering/TextureCache.cs).
+- Prior perf baseline (style template): [docs/plans/2026-05-09-phase-n5b-perf-baseline.md](../../plans/2026-05-09-phase-n5b-perf-baseline.md).
+- Roadmap N.6 entry: [docs/plans/2026-04-11-roadmap.md:690-705](../../plans/2026-04-11-roadmap.md).
+- Perf tiers 2/3 alternative path: [docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md](../../plans/2026-05-10-perf-tiers-2-3-roadmap.md).
+- Phase C.1 plan with C.1.5 scope: [docs/plans/2026-04-27-phase-c1-pes-particles.md:285-295](../../plans/2026-04-27-phase-c1-pes-particles.md).
--- a/src/AcDream.App/Rendering/GameWindow.cs
+++ b/src/AcDream.App/Rendering/GameWindow.cs
@ -6310,6 +6310,10 @@ public sealed class GameWindow : IDisposable

        _gl!.Clear(ClearBufferMask.ColorBufferBit | ClearBufferMask.DepthBufferBit);

+        // Phase N.6 slice 1: one-shot surface-format histogram dump under
+        // ACDREAM_DUMP_SURFACES=1. Zero cost when off.
+        _textureCache?.TickSurfaceHistogramDumpIfEnabled();
+
        // Phase N.4: drain WB pipeline queues (staged mesh data +
        // GL thread queue). Must happen before any draw work so that
        // resources uploaded this frame are available immediately.
--- a/src/AcDream.App/Rendering/TextureCache.cs
+++ b/src/AcDream.App/Rendering/TextureCache.cs
@ -4,6 +4,7 @@ using AcDream.Core.World;
 using DatReaderWriter;
 using DatReaderWriter.DBObjs;
 using Silk.NET.OpenGL;
+using System.Linq;
 using SurfaceType = DatReaderWriter.Enums.SurfaceType;

 namespace AcDream.App.Rendering;
@ -40,6 +41,20 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab
    private readonly Dictionary<(uint surfaceId, uint origTexOverride), (uint Name, ulong Handle)> _bindlessByOverridden = new();
    private readonly Dictionary<(uint surfaceId, uint origTexOverride, ulong paletteHash), (uint Name, ulong Handle)> _bindlessByPalette = new();

+    // Phase N.6 slice 1 (2026-05-11): per-upload metadata for the
+    // ACDREAM_DUMP_SURFACES=1 histogram dump path. Populated at upload
+    // time so the dump method doesn't have to query GL state. Keyed by
+    // GL texture name (same key used in cache value tuples). Format
+    // label is "RGBA8_DECODED" for the post-decode upload (all uploads
+    // currently land as RGBA8 regardless of source format).
+    private readonly Dictionary<uint, (int Width, int Height, string Format)> _uploadMetadata = new();
+
+    // Frame counter for the one-shot ACDREAM_DUMP_SURFACES=1 trigger.
+    // Increments per Tick call; fires the dump once at frame index 600
+    // and never again for the session. See spec §5.
+    private int _dumpFrameCounter;
+    private bool _surfaceHistogramAlreadyDumped;
+
    public TextureCache(GL gl, DatCollection dats, Wb.BindlessSupport? bindless = null)
    {
        _gl = gl;
@ -258,6 +273,114 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab
        return h;
    }

+    /// <summary>
+    /// Phase N.6 slice 1: one-shot surface-format histogram dump for the
+    /// atlas-opportunity audit. Activated by ACDREAM_DUMP_SURFACES=1; fires
+    /// once after BOTH gates pass:
+    /// 1. <c>_dumpFrameCounter &gt;= 600</c> — at least 600 OnRender ticks
+    ///    have elapsed (catches the "we're already past startup boilerplate"
+    ///    bound; ~10s at 60fps, ~3s at 200fps).
+    /// 2. <c>_uploadMetadata.Count &gt;= 100</c> — the cache contains at
+    ///    least 100 uploaded textures, indicating streaming has actually
+    ///    pulled in world content (not just sky/UI/font). The original
+    ///    frame-only gate fired during the login/handshake phase where
+    ///    OnRender ticks at GUI rates but no world has streamed in.
+    /// Output goes to %LOCALAPPDATA%\acdream\n6-surfaces.txt. Zero cost
+    /// when off. See spec §5 in
+    /// docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
+    /// </summary>
+    public void TickSurfaceHistogramDumpIfEnabled()
+    {
+        if (_surfaceHistogramAlreadyDumped) return;
+        if (!string.Equals(System.Environment.GetEnvironmentVariable("ACDREAM_DUMP_SURFACES"), "1", StringComparison.Ordinal)) return;
+        _dumpFrameCounter++;
+        if (_dumpFrameCounter < 600) return;
+        if (_uploadMetadata.Count < 100) return;
+
+        DumpSurfaceHistogram();
+        _surfaceHistogramAlreadyDumped = true;
+    }
+
+    private void DumpSurfaceHistogram()
+    {
+        try
+        {
+            DumpSurfaceHistogramCore();
+        }
+        catch (Exception ex)
+        {
+            // Diagnostic-only path. If the dump file can't be written
+            // (disk full, permission denied, antivirus lock, path too
+            // long) we must NOT crash OnRender — that would invalidate
+            // the very measurement pass this diagnostic is meant to
+            // support. Log to stderr and let the caller mark the dump
+            // as "already done" so it doesn't retry every frame.
+            Console.Error.WriteLine($"[N6-DUMP] Failed to write surface histogram: {ex.Message}");
+        }
+    }
+
+    private void DumpSurfaceHistogramCore()
+    {
+        var localAppData = System.Environment.GetFolderPath(System.Environment.SpecialFolder.LocalApplicationData);
+        var outDir = System.IO.Path.Combine(localAppData, "acdream");
+        System.IO.Directory.CreateDirectory(outDir);
+        var outPath = System.IO.Path.Combine(outDir, "n6-surfaces.txt");
+
+        var sb = new System.Text.StringBuilder();
+        sb.AppendLine($"# acdream surface-format histogram — generated {DateTime.UtcNow:yyyy-MM-ddTHH:mm:ssZ}");
+        sb.AppendLine("# Per-entry: surfaceId(hex), width, height, format, byteCount");
+        sb.AppendLine();
+
+        // Walk every cached entry across the 6 caches, dedupe by GL name.
+        var seen = new HashSet<uint>();
+        long totalBytes = 0;
+        var bucketsByDim = new Dictionary<(int W, int H), int>();
+        var bucketsByFormat = new Dictionary<string, int>();
+        var bucketsByTriple = new Dictionary<(int W, int H, string F), int>();
+
+        void Emit(uint surfaceId, uint name)
+        {
+            if (!seen.Add(name)) return;
+            if (!_uploadMetadata.TryGetValue(name, out var meta)) return;
+            int bytes = meta.Width * meta.Height * 4;
+            totalBytes += bytes;
+            sb.AppendLine($"0x{surfaceId:X8}, {meta.Width}, {meta.Height}, {meta.Format}, {bytes}");
+
+            var dimKey = (meta.Width, meta.Height);
+            bucketsByDim[dimKey] = bucketsByDim.GetValueOrDefault(dimKey) + 1;
+            bucketsByFormat[meta.Format] = bucketsByFormat.GetValueOrDefault(meta.Format) + 1;
+            var tripleKey = (meta.Width, meta.Height, meta.Format);
+            bucketsByTriple[tripleKey] = bucketsByTriple.GetValueOrDefault(tripleKey) + 1;
+        }
+
+        foreach (var kv in _handlesBySurfaceId)         Emit(kv.Key, kv.Value);
+        foreach (var kv in _handlesByOverridden)        Emit(kv.Key.surfaceId, kv.Value);
+        foreach (var kv in _handlesByPalette)           Emit(kv.Key.surfaceId, kv.Value);
+        foreach (var kv in _bindlessBySurfaceId)        Emit(kv.Key, kv.Value.Name);
+        foreach (var kv in _bindlessByOverridden)       Emit(kv.Key.surfaceId, kv.Value.Name);
+        foreach (var kv in _bindlessByPalette)          Emit(kv.Key.surfaceId, kv.Value.Name);
+
+        sb.AppendLine();
+        sb.AppendLine("# Rollups");
+        sb.AppendLine($"# Total unique GL textures: {seen.Count}");
+        sb.AppendLine($"# Total bytes (sum of W*H*4): {totalBytes}");
+
+        sb.AppendLine("# Top 10 (W,H) dimension buckets:");
+        foreach (var kv in bucketsByDim.OrderByDescending(kv => kv.Value).Take(10))
+            sb.AppendLine($"#   {kv.Key.W}x{kv.Key.H}: {kv.Value}");
+
+        sb.AppendLine("# Format buckets:");
+        foreach (var kv in bucketsByFormat.OrderByDescending(kv => kv.Value))
+            sb.AppendLine($"#   {kv.Key}: {kv.Value}");
+
+        sb.AppendLine("# Top 10 (W,H,format) triples — atlas-opportunity input:");
+        foreach (var kv in bucketsByTriple.OrderByDescending(kv => kv.Value).Take(10))
+            sb.AppendLine($"#   {kv.Key.W}x{kv.Key.H} {kv.Key.F}: {kv.Value}");
+
+        System.IO.File.WriteAllText(outPath, sb.ToString());
+        Console.WriteLine($"[N6-DUMP] Surface histogram written to {outPath} ({seen.Count} textures, {totalBytes} bytes)");
+    }
+
    private DecodedTexture DecodeFromDats(uint surfaceId, uint? origTextureOverride, PaletteOverride? paletteOverride)
    {
        var surface = _dats.Get<Surface>(surfaceId);
@ -364,6 +487,7 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab
        _gl.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureWrapT, (int)TextureWrapMode.Repeat);

        _gl.BindTexture(TextureTarget.Texture2D, 0);
+        _uploadMetadata[tex] = (decoded.Width, decoded.Height, "RGBA8_DECODED");
        return tex;
    }

@ -396,6 +520,7 @@ public sealed unsafe class TextureCache : Wb.ITextureCachePerInstance, IDisposab
        _gl.TexParameter(TextureTarget.Texture2DArray, TextureParameterName.TextureWrapT,     (int)TextureWrapMode.Repeat);

        _gl.BindTexture(TextureTarget.Texture2DArray, 0);
+        _uploadMetadata[tex] = (decoded.Width, decoded.Height, "RGBA8_DECODED");
        return tex;
    }

--- a/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs
+++ b/src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs
@ -152,8 +152,16 @@ public sealed unsafe class WbDrawDispatcher : IDisposable
    private readonly System.Diagnostics.Stopwatch _cpuStopwatch = new();
    private readonly long[] _cpuSamples = new long[256];   // microseconds
    private int _cpuSampleCursor;
-    private uint _gpuQueryOpaque;
-    private uint _gpuQueryTransparent;
+    // GPU timing uses a ring of 3 query-pair slots so the read of frame N-3's
+    // result lands when the GPU has finished (~50ms after issue on a typical
+    // 60fps frame). Ring of 3 is the vendor-neutral choice: NVIDIA drivers with
+    // triple-buffering+vsync can queue ~3 frames ahead, AMD typically 1-2,
+    // Intel iGPUs vary. ResultAvailable is the safety guard if the GPU is
+    // still working when we try to read.
+    private const int GpuQueryRingDepth = 3;
+    private readonly uint[] _gpuQueryOpaque      = new uint[GpuQueryRingDepth];
+    private readonly uint[] _gpuQueryTransparent = new uint[GpuQueryRingDepth];
+    private int _gpuQueryFrameIndex;
    private readonly long[] _gpuSamples = new long[256];   // microseconds
    private int _gpuSampleCursor;
    private bool _gpuQueriesInitialized;
@ -346,8 +354,11 @@ public sealed unsafe class WbDrawDispatcher : IDisposable

        if (diag && !_gpuQueriesInitialized)
        {
-            _gpuQueryOpaque      = _gl.GenQuery();
-            _gpuQueryTransparent = _gl.GenQuery();
+            for (int i = 0; i < GpuQueryRingDepth; i++)
+            {
+                _gpuQueryOpaque[i]      = _gl.GenQuery();
+                _gpuQueryTransparent[i] = _gl.GenQuery();
+            }
            _gpuQueriesInitialized = true;
        }

@ -754,6 +765,33 @@ public sealed unsafe class WbDrawDispatcher : IDisposable
        if (string.Equals(Environment.GetEnvironmentVariable("ACDREAM_NO_CULL"), "1", StringComparison.Ordinal))
            _gl.Disable(EnableCap.CullFace);

+        // GPU timing: compute this frame's ring slot. We read frame N-3's
+        // result (the oldest data in the ring) before overwriting it with
+        // frame N's queries. Hoisted to function scope so both the opaque
+        // and transparent passes below can reference gpuQuerySlot. See spec
+        // §3 Q1/Q2 + §4 in
+        // docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md.
+        int gpuQuerySlot = _gpuQueryFrameIndex % GpuQueryRingDepth;
+        // diag is part of the gate so the read/issue/increment trio stays
+        // symmetric — without it, toggling ACDREAM_WB_DIAG mid-session would
+        // freeze the frame counter (gated by diag below) while the read kept
+        // re-reading the same slot, producing duplicate stale samples.
+        if (diag && _gpuQueriesInitialized && _gpuQueryFrameIndex >= GpuQueryRingDepth)
+        {
+            _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot], QueryObjectParameterName.ResultAvailable, out int avail);
+            if (avail != 0)
+            {
+                _gl.GetQueryObject(_gpuQueryOpaque[gpuQuerySlot],      QueryObjectParameterName.Result, out ulong opaqueNs);
+                _gl.GetQueryObject(_gpuQueryTransparent[gpuQuerySlot], QueryObjectParameterName.Result, out ulong transNs);
+                long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
+                _gpuSamples[_gpuSampleCursor] = gpuUs;
+                _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
+            }
+            // If avail==0 the sample is dropped silently. MedianMicros
+            // computes over the non-zero subset, so dropped samples don't
+            // poison the median.
+        }
+
        // ── Phase 7: opaque pass ─────────────────────────────────────────────
        if (_opaqueDrawCount > 0)
        {
@ -771,7 +809,7 @@ public sealed unsafe class WbDrawDispatcher : IDisposable
            // mesh_modern.vert for why this is needed.
            _shader.SetInt("uDrawIDOffset", 0);
            _gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
-            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque);
+            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryOpaque[gpuQuerySlot]);
            _gl.MultiDrawElementsIndirect(
                PrimitiveType.Triangles,
                DrawElementsType.UnsignedShort,
@ -820,7 +858,7 @@ public sealed unsafe class WbDrawDispatcher : IDisposable
            _gl.CullFace(TriangleFace.Back);
            _gl.FrontFace(FrontFaceDirection.Ccw);
            _shader.SetInt("uRenderPass", 1);
-            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent);
+            if (diag && _gpuQueriesInitialized) _gl.BeginQuery(QueryTarget.TimeElapsed, _gpuQueryTransparent[gpuQuerySlot]);
            _gl.MultiDrawElementsIndirect(
                PrimitiveType.Triangles,
                DrawElementsType.UnsignedShort,
@ -843,21 +881,10 @@ public sealed unsafe class WbDrawDispatcher : IDisposable
            _cpuSamples[_cpuSampleCursor] = cpuUs;
            _cpuSampleCursor = (_cpuSampleCursor + 1) % _cpuSamples.Length;

-            // Read GPU samples non-blocking; the result for the previous frame's
-            // queries should be ready by now. If not, drop the sample (don't stall
-            // the CPU waiting for the GPU).
-            if (_gpuQueriesInitialized)
-            {
-                _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.ResultAvailable, out int avail);
-                if (avail != 0)
-                {
-                    _gl.GetQueryObject(_gpuQueryOpaque, QueryObjectParameterName.Result, out ulong opaqueNs);
-                    _gl.GetQueryObject(_gpuQueryTransparent, QueryObjectParameterName.Result, out ulong transNs);
-                    long gpuUs = (long)((opaqueNs + transNs) / 1000UL);
-                    _gpuSamples[_gpuSampleCursor] = gpuUs;
-                    _gpuSampleCursor = (_gpuSampleCursor + 1) % _gpuSamples.Length;
-                }
-            }
+            // GPU sample read happens BEFORE issuing the next frame's queries
+            // (see step 1.3 above). Increment the frame counter here so the
+            // next call computes a fresh slot.
+            if (_gpuQueriesInitialized) _gpuQueryFrameIndex++;

            _drawsIssued     += _opaqueDrawCount + _transparentDrawCount;
            _instancesIssued += totalInstances;
@ -1139,8 +1166,11 @@ public sealed unsafe class WbDrawDispatcher : IDisposable
        _gl.DeleteBuffer(_indirectBuffer);
        if (_gpuQueriesInitialized)
        {
-            _gl.DeleteQuery(_gpuQueryOpaque);
-            _gl.DeleteQuery(_gpuQueryTransparent);
+            for (int i = 0; i < GpuQueryRingDepth; i++)
+            {
+                _gl.DeleteQuery(_gpuQueryOpaque[i]);
+                _gl.DeleteQuery(_gpuQueryTransparent[i]);
+            }
        }
    }