# Phase N.6 slice 1 — perf baseline at Holtburg **Created:** 2026-05-11. **Spec:** [docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md](../superpowers/specs/2026-05-11-phase-n6-slice1-design.md) **Measured against commit:** `25cb147` (Task 1 final — gpu_us fix + diag-gate symmetry follow-up) **Purpose:** Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data. --- ## §1. Setup - **Hardware:** Radeon RX 9070 XT - **Resolution:** 1440p (2560×1440) - **Quality preset:** High (default) - **Connection:** live ACE at `127.0.0.1:9000` - **Character:** `+Acdream` at Holtburg - **Sky / time:** clear midday (F7 → Noon, F10 → Clear) - **Build:** Debug - **Date measured:** 2026-05-11 - **Environment overrides:** `ACDREAM_WB_DIAG=1`, `ACDREAM_STREAM_RADIUS=` Note: `ACDREAM_STREAM_RADIUS=N` forces N₁=N (all N near-tier landblocks at full detail). This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in CLAUDE.md as comfortable 200–400 FPS at the default preset. These measurements characterize the scaling curve — what happens as near-tier radius grows — not current production behavior. FPS was not captured directly (no window-title screenshot per run); it can be derived from `(1e6 / total_frame_time_us)` but the dispatcher's `cpu_us` is only part of the frame (terrain, sky, particles, UI, GL submission overhead, and swap-buffer wait are not included). ## §2. Dispatch CPU / GPU numbers Each cell records the median of the last 3 `[WB-DIAG]` lines from a ~30s stable window. `entSeen / entDrawn / groups / drawsIssued` are also from those lines (values per 5s bucket). FPS column omitted — not captured per the note above. | Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | entSeen (per 5s) | entDrawn (per 5s) | groups | drawsIssued (per 5s) | |--------|------------|---------------|------------|---------------|------------|------------------|-------------------|--------|----------------------| | 4 | standstill | 3,208 | 3,313 | 93 | 95 | 16.9M | 15.5M | 1,216 | 1.65M | | 4 | walking | 2,967 | 3,112 | 95 | 120 | 13.9M | 13.9M | 1,850 | 1.45M | | 8 | standstill | 6,732 | 7,199 | 126 | 130 | 19.8M | 19.8M | 333 | 218K | | 8 | walking | 6,572 | 6,927 | 96 | 113 | 18.1M | 18.0M | 534 | 245K | | 12 | standstill | 12,853 | 13,525 | 344 | 507 | 19.6M | 19.6M | 541 | 184K | | 12 | walking | 16,320 | 17,241 | 553 | 603 | 17.8M | 17.8M | 898 | 200K | **Notable:** `meshMissing` counts at r4 standstill (~1.45M per 5s) drop to near-zero while walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay before populating for newly-streamed content. Not fatal — doesn't affect rendered output — but worth a follow-up issue in `docs/ISSUES.md` if it persists in normal play. ## §3. Surface-format histogram From `ACDREAM_DUMP_SURFACES=1` at radius=12, ~30s after enter-world. Output written to `%LOCALAPPDATA%\acdream\n6-surfaces.txt`. - **Total unique GL textures:** 760 - **Total bytes (sum of W×H×4):** 96,387,584 (~96.4 MB) **Top 10 (W, H) dimension buckets:** | Dimensions | Count | Share | |------------|-------|-------| | 128×128 | 236 | 31% | | 64×64 | 111 | 15% | | 256×256 | 102 | 13% | | 128×256 | 71 | 9% | | 64×128 | 69 | 9% | | 256×128 | 48 | 6% | | 128×64 | 39 | 5% | | 512×512 | 30 | 4% | | 8×8 | 18 | 2% | | 32×32 | 14 | 2% | **Format distribution:** | Format | Count | Share | |---------------|-------|-------| | RGBA8_DECODED | 760 | 100% | All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc. all decode through `TextureHelpers` before upload). The source-format diversity is real but invisible to GL after the decode step. **Top 10 (W, H, format) triples — atlas-opportunity input:** Same as the dimension buckets above since there is only one format. The top-3 triples (128×128, 64×64, 256×256) cover 449 of 760 surfaces = **59%**. **Atlas-opportunity score: 59%** of surfaces fall into the top-3 (W, H, format) triples. A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas packing worth the implementation cost for memory savings; this measurement is well above that. However, see §4 for why atlas is not the right next step despite the high score. ## §4. Conclusion + next-phase recommendation ### What the data shows **The entity dispatcher is strongly CPU-bound.** At every radius, CPU dominates GPU by 30–50×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck. **GPU is wildly under-utilized.** The highest gpu_us p95 observed is 603 µs at radius=12 walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly 3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky, particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged. **CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with visible-LB count.** As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count scales as `(2N+1)²`: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0× vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs early, but the outer per-LB walk still has to touch each one. The Tier 1 entity- classification cache (`EntityClassificationCache`, shipped as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the outer walk dominates as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses by eliminating the per-frame LB scan entirely. **Radius=12 is not the production scenario.** `ACDREAM_STREAM_RADIUS=12` forces N₁=12 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling curve for headroom analysis, not the experience a typical player sees. **Atlas opportunity is high (59%) but the win is memory-only — and modest.** With 96 MB of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the top buckets share single `Texture2DArray` objects rather than each surface owning its own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture binds — are already near-zero because bindless textures are made resident once at upload and never bound per draw. The remaining win is the per-array metadata overhead × N surfaces, which is bounded but not dramatic given all surfaces are already power-of-two and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory saving the process doesn't currently need at 96 MB. ### Recommendation **Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces).** Four reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline" data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path before any future shader migration work in slice 2 touches `particle.frag`. Starting point for C.1.5 scoping: `docs/plans/2026-04-27-phase-c1-pes-particles.md` lines 285–295. **Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope.** The baseline data justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely. What remains is a ~1-day cleanup: retire orphan `mesh.frag` (verify zero callers post-N.5 amendment), collapse dead `_handlesByOverridden` / `_handlesByPalette` legacy caches once their callers are confirmed gone, migrate `particle.frag` to bindless sampling after C.1.5 stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase. **Tertiary option (if perf escalation becomes pressing): Tier 2 first.** The scaling curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the bottleneck — exactly what Tier 2's persistent-group structure at `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` addresses. Not urgent at the current default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the 200–400 FPS range at N₁=4 shrinks after more content is streamed. **Decision rule for revisiting:** if future measurement at the default preset shows cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. Otherwise, hold the C.1.5 → reduced-slice-2 sequence. ## §5. Reproducing the measurements Raw `[WB-DIAG]` output from each run was inspected live during measurement and the median of the last three steady-state lines from each scenario was transcribed into §2. The raw launch logs were not preserved — the captured medians in §2 are the canonical record. To reproduce on the same hardware: ```powershell $env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call" $env:ACDREAM_LIVE = "1" $env:ACDREAM_TEST_HOST = "127.0.0.1" $env:ACDREAM_TEST_PORT = "9000" $env:ACDREAM_TEST_USER = "testaccount" $env:ACDREAM_TEST_PASS = "testpassword" $env:ACDREAM_WB_DIAG = "1" $env:ACDREAM_STREAM_RADIUS = "4" # or 8, 12 dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline.log" ``` Stand still for ~30 s at the target radius (60 s at radius 12 to let streaming settle), or walk N→E→S→W across one landblock. Then `Select-String -Path baseline.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3` captures the steady-state numbers. For the surface histogram, also set `$env:ACDREAM_DUMP_SURFACES = "1"`, stay in-world ~30 s after streaming has loaded ≥100 textures (the cache-size gate), then read `$env:LOCALAPPDATA\acdream\n6-surfaces.txt`.