acdream/docs/plans/2026-05-11-phase-n6-perf-baseline.md
Erik 76ca3ffca8 docs(perf #N6.1): apply code-quality review fixes to baseline doc
Code-quality review on commit 13abf96 flagged 3 Important issues in
the baseline document plus 2 minor roadmap consistency gaps. Applied
all of them:

1. The "CPU scales superlinearly with N₁" claim was imprecise because
   CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×).
   Clarified: CPU grows more than linearly with radius N₁ but
   sublinearly with visible-LB count; frustum cull discards most far
   LBs early. The outer per-LB walk still scales with N₁, which is
   what Tier 2's persistent groups address.

2. The "40-50% memory footprint reduction from atlas packing" estimate
   was asserted without derivation and likely too optimistic given all
   surfaces are already power-of-two and same-format (RGBA8). Replaced
   with a more honest bound: "low-MB to ~10 MB absolute saving" with
   explicit per-array metadata overhead reasoning. Conclusion is
   unchanged — atlas adoption still isn't justified given GPU
   under-utilization.

3. The "spec §6 threshold for atlas is >30%" citation pointed at text
   that doesn't exist in the spec. Replaced with "A conventional
   rule-of-thumb" so a future reader doesn't chase a phantom citation.

Plus roadmap consistency:

M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title.
   Shipped YYYY-MM-DD." prefix that every other shipped phase uses.
M2: Added N.6.1 row to the shipped table at the top of the roadmap
   (lines ~55-66) so the at-a-glance shipped list is complete.

None of these change the conclusion or the next-phase recommendation
(C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy
and future-readability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:43:35 +02:00

10 KiB
Raw Blame History

Phase N.6 slice 1 — perf baseline at Holtburg

Created: 2026-05-11. Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md Measured against commit: 25cb147 (Task 1 final — gpu_us fix + diag-gate symmetry follow-up) Purpose: Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.


§1. Setup

  • Hardware: Radeon RX 9070 XT
  • Resolution: 1440p (2560×1440)
  • Quality preset: High (default)
  • Connection: live ACE at 127.0.0.1:9000
  • Character: +Acdream at Holtburg
  • Sky / time: clear midday (F7 → Noon, F10 → Clear)
  • Build: Debug
  • Date measured: 2026-05-11
  • Environment overrides: ACDREAM_WB_DIAG=1, ACDREAM_STREAM_RADIUS=<per-run>

Note: ACDREAM_STREAM_RADIUS=N forces N₁=N (all N near-tier landblocks at full detail). This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in CLAUDE.md as comfortable 200400 FPS at the default preset. These measurements characterize the scaling curve — what happens as near-tier radius grows — not current production behavior. FPS was not captured directly (no window-title screenshot per run); it can be derived from (1e6 / total_frame_time_us) but the dispatcher's cpu_us is only part of the frame (terrain, sky, particles, UI, GL submission overhead, and swap-buffer wait are not included).

§2. Dispatch CPU / GPU numbers

Each cell records the median of the last 3 [WB-DIAG] lines from a ~30s stable window. entSeen / entDrawn / groups / drawsIssued are also from those lines (values per 5s bucket). FPS column omitted — not captured per the note above.

Radius Motion cpu_us median cpu_us p95 gpu_us median gpu_us p95 entSeen (per 5s) entDrawn (per 5s) groups drawsIssued (per 5s)
4 standstill 3,208 3,313 93 95 16.9M 15.5M 1,216 1.65M
4 walking 2,967 3,112 95 120 13.9M 13.9M 1,850 1.45M
8 standstill 6,732 7,199 126 130 19.8M 19.8M 333 218K
8 walking 6,572 6,927 96 113 18.1M 18.0M 534 245K
12 standstill 12,853 13,525 344 507 19.6M 19.6M 541 184K
12 walking 16,320 17,241 553 603 17.8M 17.8M 898 200K

Notable: meshMissing counts at r4 standstill (~1.45M per 5s) drop to near-zero while walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay before populating for newly-streamed content. Not fatal — doesn't affect rendered output — but worth a follow-up issue in docs/ISSUES.md if it persists in normal play.

§3. Surface-format histogram

From ACDREAM_DUMP_SURFACES=1 at radius=12, ~30s after enter-world. Output written to %LOCALAPPDATA%\acdream\n6-surfaces.txt.

  • Total unique GL textures: 760
  • Total bytes (sum of W×H×4): 96,387,584 (~96.4 MB)

Top 10 (W, H) dimension buckets:

Dimensions Count Share
128×128 236 31%
64×64 111 15%
256×256 102 13%
128×256 71 9%
64×128 69 9%
256×128 48 6%
128×64 39 5%
512×512 30 4%
8×8 18 2%
32×32 14 2%

Format distribution:

Format Count Share
RGBA8_DECODED 760 100%

All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc. all decode through TextureHelpers before upload). The source-format diversity is real but invisible to GL after the decode step.

Top 10 (W, H, format) triples — atlas-opportunity input:

Same as the dimension buckets above since there is only one format. The top-3 triples (128×128, 64×64, 256×256) cover 449 of 760 surfaces = 59%.

Atlas-opportunity score: 59% of surfaces fall into the top-3 (W, H, format) triples. A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas packing worth the implementation cost for memory savings; this measurement is well above that. However, see §4 for why atlas is not the right next step despite the high score.

§4. Conclusion + next-phase recommendation

What the data shows

The entity dispatcher is strongly CPU-bound. At every radius, CPU dominates GPU by 3050×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck.

GPU is wildly under-utilized. The highest gpu_us p95 observed is 603 µs at radius=12 walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly 3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky, particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.

CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with visible-LB count. As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms → 6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count scales as (2N+1)²: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0× vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs early, but the outer per-LB walk still has to touch each one. The Tier 1 entity- classification cache (EntityClassificationCache, shipped as #53) wins on the inner loop (per-entity classification avoided on cache hits) but the outer walk dominates as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses by eliminating the per-frame LB scan entirely.

Radius=12 is not the production scenario. ACDREAM_STREAM_RADIUS=12 forces N₁=12 (625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81 full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as comfortable 200400 FPS at the default preset. The numbers above characterize the scaling curve for headroom analysis, not the experience a typical player sees.

Atlas opportunity is high (59%) but the win is memory-only — and modest. With 96 MB of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the top buckets share single Texture2DArray objects rather than each surface owning its own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture binds — are already near-zero because bindless textures are made resident once at upload and never bound per draw. The remaining win is the per-array metadata overhead × N surfaces, which is bounded but not dramatic given all surfaces are already power-of-two and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on the order of low-MB to ~10 MB, not a 4050% halving. GPU is not bottlenecked on sampler switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this directly), so atlas adoption would cost 12 weeks of implementation risk for a memory saving the process doesn't currently need at 96 MB.

Recommendation

Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces). Four reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline" data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path before any future shader migration work in slice 2 touches particle.frag. Starting point for C.1.5 scoping: docs/plans/2026-04-27-phase-c1-pes-particles.md lines 285295.

Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope. The baseline data justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely. What remains is a ~1-day cleanup: retire orphan mesh.frag (verify zero callers post-N.5 amendment), collapse dead _handlesByOverridden / _handlesByPalette legacy caches once their callers are confirmed gone, migrate particle.frag to bindless sampling after C.1.5 stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase.

Tertiary option (if perf escalation becomes pressing): Tier 2 first. The scaling curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the bottleneck — exactly what Tier 2's persistent-group structure at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses. Not urgent at the current default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the 200400 FPS range at N₁=4 shrinks after more content is streamed.

Decision rule for revisiting: if future measurement at the default preset shows cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. Otherwise, hold the C.1.5 → reduced-slice-2 sequence.

§5. Raw logs

Scratch logs from this measurement run (not committed; can be deleted once the doc is reviewed):

  • baseline-r4-stand.log, baseline-r4-walk.log
  • baseline-r8-stand.log, baseline-r8-walk.log
  • baseline-r12-stand.log, baseline-r12-walk.log
  • baseline-surfaces.log (launch log for ACDREAM_DUMP_SURFACES=1 run)
  • baseline-surfaces.txt (copy of %LOCALAPPDATA%\acdream\n6-surfaces.txt)
  • task1-verify.log (Task 1 manual verification log)