Erik 41981c4d74 docs(perf #N6.1): apply final-review fixes — spec, baseline doc, issue #55

Final code review of slice 1 flagged one Important issue (the spec's
"zero cost when off" claim for the surface-dump path is technically
violated — _uploadMetadata always writes one dict entry per upload
regardless of env var) plus minor doc/consistency gaps. Applied:

1. Spec §5 "Cost when off": dropped the "Zero" claim; replaced with
   "Negligible — one Dictionary write per upload (~30-50 KB at Holtburg)
   plus a hash-table write per upload. Expensive work (file I/O,
   histogram construction) is still env-gated." This matches reality.

2. Baseline doc §5: rewrote from "Raw logs (scratch, can be deleted)"
   referencing files that were never preserved in this worktree, to
   "Reproducing the measurements" with the actual PowerShell launch
   commands. Honest about the raw logs not being kept; the captured
   medians in section 2 are the canonical record.

3. New issue #55 filed in docs/ISSUES.md — static-entity slow path
   reports ~1.45M meshMissing/5s at r4 standstill, drops to ~0 when
   walking. LOW severity (no visible regression), hypothesis points
   at a "permanently-missing entity gets re-classified every frame"
   pattern that Tier 1 cache doesn't cover.

4. Roadmap shipped table: renamed "N.6.1" row to "N.6 slice 1" to
   match every other artifact's naming. Search-discoverability fix.

None of these change the slice's conclusion or next-phase
recommendation (C.1.5 first, then reduced-scope slice 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 12:51:10 +02:00

11 KiB

Raw Blame History

Phase N.6 slice 1 — perf baseline at Holtburg

Created: 2026-05-11. Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md Measured against commit: 25cb147 (Task 1 final — gpu_us fix + diag-gate symmetry follow-up) Purpose: Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.

§1. Setup

Hardware: Radeon RX 9070 XT
Resolution: 1440p (2560×1440)
Quality preset: High (default)
Connection: live ACE at 127.0.0.1:9000
Character: +Acdream at Holtburg
Sky / time: clear midday (F7 → Noon, F10 → Clear)
Build: Debug
Date measured: 2026-05-11
Environment overrides: ACDREAM_WB_DIAG=1, ACDREAM_STREAM_RADIUS=<per-run>

Note: ACDREAM_STREAM_RADIUS=N forces N₁=N (all N near-tier landblocks at full detail). This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in CLAUDE.md as comfortable 200–400 FPS at the default preset. These measurements characterize the scaling curve — what happens as near-tier radius grows — not current production behavior. FPS was not captured directly (no window-title screenshot per run); it can be derived from (1e6 / total_frame_time_us) but the dispatcher's cpu_us is only part of the frame (terrain, sky, particles, UI, GL submission overhead, and swap-buffer wait are not included).

§2. Dispatch CPU / GPU numbers

Each cell records the median of the last 3 [WB-DIAG] lines from a ~30s stable window. entSeen / entDrawn / groups / drawsIssued are also from those lines (values per 5s bucket). FPS column omitted — not captured per the note above.

Radius	Motion	cpu_us median	cpu_us p95	gpu_us median	gpu_us p95	entSeen (per 5s)	entDrawn (per 5s)	groups	drawsIssued (per 5s)
4	standstill	3,208	3,313	93	95	16.9M	15.5M	1,216	1.65M
4	walking	2,967	3,112	95	120	13.9M	13.9M	1,850	1.45M
8	standstill	6,732	7,199	126	130	19.8M	19.8M	333	218K
8	walking	6,572	6,927	96	113	18.1M	18.0M	534	245K
12	standstill	12,853	13,525	344	507	19.6M	19.6M	541	184K
12	walking	16,320	17,241	553	603	17.8M	17.8M	898	200K

Notable: meshMissing counts at r4 standstill (~1.45M per 5s) drop to near-zero while walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay before populating for newly-streamed content. Not fatal — doesn't affect rendered output — but worth a follow-up issue in docs/ISSUES.md if it persists in normal play.

§3. Surface-format histogram

From ACDREAM_DUMP_SURFACES=1 at radius=12, ~30s after enter-world. Output written to %LOCALAPPDATA%\acdream\n6-surfaces.txt.

Total unique GL textures: 760
Total bytes (sum of W×H×4): 96,387,584 (~96.4 MB)

Top 10 (W, H) dimension buckets:

Dimensions	Count	Share
128×128	236	31%
64×64	111	15%
256×256	102	13%
128×256	71	9%
64×128	69	9%
256×128	48	6%
128×64	39	5%
512×512	30	4%
8×8	18	2%
32×32	14	2%

Format distribution:

Format	Count	Share
RGBA8_DECODED	760	100%

All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc. all decode through TextureHelpers before upload). The source-format diversity is real but invisible to GL after the decode step.

Top 10 (W, H, format) triples — atlas-opportunity input:

Same as the dimension buckets above since there is only one format. The top-3 triples (128×128, 64×64, 256×256) cover 449 of 760 surfaces = 59%.

Atlas-opportunity score: 59% of surfaces fall into the top-3 (W, H, format) triples. A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas packing worth the implementation cost for memory savings; this measurement is well above that. However, see §4 for why atlas is not the right next step despite the high score.

§4. Conclusion + next-phase recommendation

What the data shows

The entity dispatcher is strongly CPU-bound. At every radius, CPU dominates GPU by 30–50×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck.

GPU is wildly under-utilized. The highest gpu_us p95 observed is 603 µs at radius=12 walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly 3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky, particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.

CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with visible-LB count. As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count scales as (2N+1)²: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0× vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs early, but the outer per-LB walk still has to touch each one. The Tier 1 entity- classification cache (EntityClassificationCache, shipped as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the outer walk dominates as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses by eliminating the per-frame LB scan entirely.

Radius=12 is not the production scenario. ACDREAM_STREAM_RADIUS=12 forces N₁=12 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling curve for headroom analysis, not the experience a typical player sees.

Atlas opportunity is high (59%) but the win is memory-only — and modest. With 96 MB of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the top buckets share single Texture2DArray objects rather than each surface owning its own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture binds — are already near-zero because bindless textures are made resident once at upload and never bound per draw. The remaining win is the per-array metadata overhead × N surfaces, which is bounded but not dramatic given all surfaces are already power-of-two and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory saving the process doesn't currently need at 96 MB.

Recommendation

Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces). Four reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline" data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path before any future shader migration work in slice 2 touches particle.frag. Starting point for C.1.5 scoping: docs/plans/2026-04-27-phase-c1-pes-particles.md lines 285–295.

Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope. The baseline data justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely. What remains is a ~1-day cleanup: retire orphan mesh.frag (verify zero callers post-N.5 amendment), collapse dead _handlesByOverridden / _handlesByPalette legacy caches once their callers are confirmed gone, migrate particle.frag to bindless sampling after C.1.5 stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase.

Tertiary option (if perf escalation becomes pressing): Tier 2 first. The scaling curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the bottleneck — exactly what Tier 2's persistent-group structure at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses. Not urgent at the current default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the 200–400 FPS range at N₁=4 shrinks after more content is streamed.

Decision rule for revisiting: if future measurement at the default preset shows cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. Otherwise, hold the C.1.5 → reduced-slice-2 sequence.

§5. Reproducing the measurements

Raw [WB-DIAG] output from each run was inspected live during measurement and the median of the last three steady-state lines from each scenario was transcribed into §2. The raw launch logs were not preserved — the captured medians in §2 are the canonical record. To reproduce on the same hardware:

$env:ACDREAM_DAT_DIR   = "$env:USERPROFILE\Documents\Asheron's Call"
$env:ACDREAM_LIVE      = "1"
$env:ACDREAM_TEST_HOST = "127.0.0.1"
$env:ACDREAM_TEST_PORT = "9000"
$env:ACDREAM_TEST_USER = "testaccount"
$env:ACDREAM_TEST_PASS = "testpassword"
$env:ACDREAM_WB_DIAG   = "1"
$env:ACDREAM_STREAM_RADIUS = "4"  # or 8, 12
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath "baseline.log"

Stand still for ~30 s at the target radius (60 s at radius 12 to let streaming settle), or walk N→E→S→W across one landblock. Then Select-String -Path baseline.log -Pattern "\[WB-DIAG\]" | Select-Object -Last 3 captures the steady-state numbers.

For the surface histogram, also set $env:ACDREAM_DUMP_SURFACES = "1", stay in-world ~30 s after streaming has loaded ≥100 textures (the cache-size gate), then read $env:LOCALAPPDATA\acdream\n6-surfaces.txt.

11 KiB Raw Blame History Unescape Escape