Final code review of slice 1 flagged one Important issue (the spec's "zero cost when off" claim for the surface-dump path is technically violated — _uploadMetadata always writes one dict entry per upload regardless of env var) plus minor doc/consistency gaps. Applied: 1. Spec §5 "Cost when off": dropped the "Zero" claim; replaced with "Negligible — one Dictionary write per upload (~30-50 KB at Holtburg) plus a hash-table write per upload. Expensive work (file I/O, histogram construction) is still env-gated." This matches reality. 2. Baseline doc §5: rewrote from "Raw logs (scratch, can be deleted)" referencing files that were never preserved in this worktree, to "Reproducing the measurements" with the actual PowerShell launch commands. Honest about the raw logs not being kept; the captured medians in section 2 are the canonical record. 3. New issue #55 filed in docs/ISSUES.md — static-entity slow path reports ~1.45M meshMissing/5s at r4 standstill, drops to ~0 when walking. LOW severity (no visible regression), hypothesis points at a "permanently-missing entity gets re-classified every frame" pattern that Tier 1 cache doesn't cover. 4. Roadmap shipped table: renamed "N.6.1" row to "N.6 slice 1" to match every other artifact's naming. Search-discoverability fix. None of these change the slice's conclusion or next-phase recommendation (C.1.5 first, then reduced-scope slice 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
Phase N.6 slice 1 — perf baseline at Holtburg
Created: 2026-05-11.
Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
Measured against commit: 25cb147 (Task 1 final — gpu_us fix + diag-gate symmetry follow-up)
Purpose: Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.
§1. Setup
- Hardware: Radeon RX 9070 XT
- Resolution: 1440p (2560×1440)
- Quality preset: High (default)
- Connection: live ACE at
127.0.0.1:9000 - Character:
+Acdreamat Holtburg - Sky / time: clear midday (F7 → Noon, F10 → Clear)
- Build: Debug
- Date measured: 2026-05-11
- Environment overrides:
ACDREAM_WB_DIAG=1,ACDREAM_STREAM_RADIUS=<per-run>
Note: ACDREAM_STREAM_RADIUS=N forces N₁=N (all N near-tier landblocks at full detail).
This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in
CLAUDE.md as comfortable 200–400 FPS at the default preset. These measurements
characterize the scaling curve — what happens as near-tier radius grows — not current
production behavior. FPS was not captured directly (no window-title screenshot per run);
it can be derived from (1e6 / total_frame_time_us) but the dispatcher's cpu_us is
only part of the frame (terrain, sky, particles, UI, GL submission overhead, and
swap-buffer wait are not included).
§2. Dispatch CPU / GPU numbers
Each cell records the median of the last 3 [WB-DIAG] lines from a ~30s stable window.
entSeen / entDrawn / groups / drawsIssued are also from those lines (values per 5s bucket).
FPS column omitted — not captured per the note above.
| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | entSeen (per 5s) | entDrawn (per 5s) | groups | drawsIssued (per 5s) |
|---|---|---|---|---|---|---|---|---|---|
| 4 | standstill | 3,208 | 3,313 | 93 | 95 | 16.9M | 15.5M | 1,216 | 1.65M |
| 4 | walking | 2,967 | 3,112 | 95 | 120 | 13.9M | 13.9M | 1,850 | 1.45M |
| 8 | standstill | 6,732 | 7,199 | 126 | 130 | 19.8M | 19.8M | 333 | 218K |
| 8 | walking | 6,572 | 6,927 | 96 | 113 | 18.1M | 18.0M | 534 | 245K |
| 12 | standstill | 12,853 | 13,525 | 344 | 507 | 19.6M | 19.6M | 541 | 184K |
| 12 | walking | 16,320 | 17,241 | 553 | 603 | 17.8M | 17.8M | 898 | 200K |
Notable: meshMissing counts at r4 standstill (~1.45M per 5s) drop to near-zero while
walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay
before populating for newly-streamed content. Not fatal — doesn't affect rendered output —
but worth a follow-up issue in docs/ISSUES.md if it persists in normal play.
§3. Surface-format histogram
From ACDREAM_DUMP_SURFACES=1 at radius=12, ~30s after enter-world.
Output written to %LOCALAPPDATA%\acdream\n6-surfaces.txt.
- Total unique GL textures: 760
- Total bytes (sum of W×H×4): 96,387,584 (~96.4 MB)
Top 10 (W, H) dimension buckets:
| Dimensions | Count | Share |
|---|---|---|
| 128×128 | 236 | 31% |
| 64×64 | 111 | 15% |
| 256×256 | 102 | 13% |
| 128×256 | 71 | 9% |
| 64×128 | 69 | 9% |
| 256×128 | 48 | 6% |
| 128×64 | 39 | 5% |
| 512×512 | 30 | 4% |
| 8×8 | 18 | 2% |
| 32×32 | 14 | 2% |
Format distribution:
| Format | Count | Share |
|---|---|---|
| RGBA8_DECODED | 760 | 100% |
All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc.
all decode through TextureHelpers before upload). The source-format diversity is real
but invisible to GL after the decode step.
Top 10 (W, H, format) triples — atlas-opportunity input:
Same as the dimension buckets above since there is only one format. The top-3 triples (128×128, 64×64, 256×256) cover 449 of 760 surfaces = 59%.
Atlas-opportunity score: 59% of surfaces fall into the top-3 (W, H, format) triples. A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas packing worth the implementation cost for memory savings; this measurement is well above that. However, see §4 for why atlas is not the right next step despite the high score.
§4. Conclusion + next-phase recommendation
What the data shows
The entity dispatcher is strongly CPU-bound. At every radius, CPU dominates GPU by 30–50×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck.
GPU is wildly under-utilized. The highest gpu_us p95 observed is 603 µs at radius=12 walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly 3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky, particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.
CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with
visible-LB count. As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms →
6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count
scales as (2N+1)²: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0×
vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs
early, but the outer per-LB walk still has to touch each one. The Tier 1 entity-
classification cache (EntityClassificationCache, shipped as #53) wins on the inner
loop (per-entity classification avoided on cache hits) but the outer walk dominates
as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at
docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses by eliminating the
per-frame LB scan entirely.
Radius=12 is not the production scenario. ACDREAM_STREAM_RADIUS=12 forces N₁=12
(625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81
full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as
comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling
curve for headroom analysis, not the experience a typical player sees.
Atlas opportunity is high (59%) but the win is memory-only — and modest. With 96 MB
of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the
top buckets share single Texture2DArray objects rather than each surface owning its
own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture
binds — are already near-zero because bindless textures are made resident once at upload
and never bound per draw. The remaining win is the per-array metadata overhead × N
surfaces, which is bounded but not dramatic given all surfaces are already power-of-two
and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on
the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler
switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this
directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory
saving the process doesn't currently need at 96 MB.
Recommendation
Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces). Four
reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset
per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline"
data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not
bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped
and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path
before any future shader migration work in slice 2 touches particle.frag. Starting
point for C.1.5 scoping: docs/plans/2026-04-27-phase-c1-pes-particles.md lines 285–295.
Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope. The baseline data
justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely.
What remains is a ~1-day cleanup: retire orphan mesh.frag (verify zero callers post-N.5
amendment), collapse dead _handlesByOverridden / _handlesByPalette legacy caches once
their callers are confirmed gone, migrate particle.frag to bindless sampling after C.1.5
stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase.
Tertiary option (if perf escalation becomes pressing): Tier 2 first. The scaling
curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the
bottleneck — exactly what Tier 2's persistent-group structure at
docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses. Not urgent at the current
default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the
200–400 FPS range at N₁=4 shrinks after more content is streamed.
Decision rule for revisiting: if future measurement at the default preset shows cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. Otherwise, hold the C.1.5 → reduced-slice-2 sequence.
§5. Reproducing the measurements
Raw [WB-DIAG] output from each run was inspected live during measurement and the
median of the last three steady-state lines from each scenario was transcribed into §2.
The raw launch logs were not preserved — the captured medians in §2 are the canonical
record. To reproduce on the same hardware:
$env:ACDREAM_DAT_DIR = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG = "1"
$env:ACDREAM_STREAM_RADIUS = "4" # or 8, 12
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline.log"
Stand still for ~30 s at the target radius (60 s at radius 12 to let streaming settle),
or walk N→E→S→W across one landblock. Then Select-String -Path baseline.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3 captures the steady-state numbers.
For the surface histogram, also set $env:ACDREAM_DUMP_SURFACES = "1", stay in-world
~30 s after streaming has loaded ≥100 textures (the cache-size gate), then read
$env:LOCALAPPDATA\acdream\n6-surfaces.txt.