Code-quality review on commit 13abf96 flagged 3 Important issues in
the baseline document plus 2 minor roadmap consistency gaps. Applied
all of them:
1. The "CPU scales superlinearly with N₁" claim was imprecise because
CPU growth (4.0×) is actually sublinear vs near-LB count (7.7×).
Clarified: CPU grows more than linearly with radius N₁ but
sublinearly with visible-LB count; frustum cull discards most far
LBs early. The outer per-LB walk still scales with N₁, which is
what Tier 2's persistent groups address.
2. The "40-50% memory footprint reduction from atlas packing" estimate
was asserted without derivation and likely too optimistic given all
surfaces are already power-of-two and same-format (RGBA8). Replaced
with a more honest bound: "low-MB to ~10 MB absolute saving" with
explicit per-array metadata overhead reasoning. Conclusion is
unchanged — atlas adoption still isn't justified given GPU
under-utilization.
3. The "spec §6 threshold for atlas is >30%" citation pointed at text
that doesn't exist in the spec. Replaced with "A conventional
rule-of-thumb" so a future reader doesn't chase a phantom citation.
Plus roadmap consistency:
M1: The N.6 slice 1 bullet now uses the canonical "✓ SHIPPED — Title.
Shipped YYYY-MM-DD." prefix that every other shipped phase uses.
M2: Added N.6.1 row to the shipped table at the top of the roadmap
(lines ~55-66) so the at-a-glance shipped list is complete.
None of these change the conclusion or the next-phase recommendation
(C.1.5 first, then reduced N.6 slice 2). The fixes improve doc accuracy
and future-readability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
Phase N.6 slice 1 — perf baseline at Holtburg
Created: 2026-05-11.
Spec: docs/superpowers/specs/2026-05-11-phase-n6-slice1-design.md
Measured against commit: 25cb147 (Task 1 final — gpu_us fix + diag-gate symmetry follow-up)
Purpose: Capture authoritative CPU+GPU dispatch numbers so the next-phase decision (slice 2 vs C.1.5 vs Tier 2) rests on real data.
§1. Setup
- Hardware: Radeon RX 9070 XT
- Resolution: 1440p (2560×1440)
- Quality preset: High (default)
- Connection: live ACE at
127.0.0.1:9000 - Character:
+Acdreamat Holtburg - Sky / time: clear midday (F7 → Noon, F10 → Clear)
- Build: Debug
- Date measured: 2026-05-11
- Environment overrides:
ACDREAM_WB_DIAG=1,ACDREAM_STREAM_RADIUS=<per-run>
Note: ACDREAM_STREAM_RADIUS=N forces N₁=N (all N near-tier landblocks at full detail).
This is NOT the production A.5 default (N₁=4 / N₂=12), which was characterized in
CLAUDE.md as comfortable 200–400 FPS at the default preset. These measurements
characterize the scaling curve — what happens as near-tier radius grows — not current
production behavior. FPS was not captured directly (no window-title screenshot per run);
it can be derived from (1e6 / total_frame_time_us) but the dispatcher's cpu_us is
only part of the frame (terrain, sky, particles, UI, GL submission overhead, and
swap-buffer wait are not included).
§2. Dispatch CPU / GPU numbers
Each cell records the median of the last 3 [WB-DIAG] lines from a ~30s stable window.
entSeen / entDrawn / groups / drawsIssued are also from those lines (values per 5s bucket).
FPS column omitted — not captured per the note above.
| Radius | Motion | cpu_us median | cpu_us p95 | gpu_us median | gpu_us p95 | entSeen (per 5s) | entDrawn (per 5s) | groups | drawsIssued (per 5s) |
|---|---|---|---|---|---|---|---|---|---|
| 4 | standstill | 3,208 | 3,313 | 93 | 95 | 16.9M | 15.5M | 1,216 | 1.65M |
| 4 | walking | 2,967 | 3,112 | 95 | 120 | 13.9M | 13.9M | 1,850 | 1.45M |
| 8 | standstill | 6,732 | 7,199 | 126 | 130 | 19.8M | 19.8M | 333 | 218K |
| 8 | walking | 6,572 | 6,927 | 96 | 113 | 18.1M | 18.0M | 534 | 245K |
| 12 | standstill | 12,853 | 13,525 | 344 | 507 | 19.6M | 19.6M | 541 | 184K |
| 12 | walking | 16,320 | 17,241 | 553 | 603 | 17.8M | 17.8M | 898 | 200K |
Notable: meshMissing counts at r4 standstill (~1.45M per 5s) drop to near-zero while
walking. This suggests the static-entity slow path's mesh-load lifecycle has some delay
before populating for newly-streamed content. Not fatal — doesn't affect rendered output —
but worth a follow-up issue in docs/ISSUES.md if it persists in normal play.
§3. Surface-format histogram
From ACDREAM_DUMP_SURFACES=1 at radius=12, ~30s after enter-world.
Output written to %LOCALAPPDATA%\acdream\n6-surfaces.txt.
- Total unique GL textures: 760
- Total bytes (sum of W×H×4): 96,387,584 (~96.4 MB)
Top 10 (W, H) dimension buckets:
| Dimensions | Count | Share |
|---|---|---|
| 128×128 | 236 | 31% |
| 64×64 | 111 | 15% |
| 256×256 | 102 | 13% |
| 128×256 | 71 | 9% |
| 64×128 | 69 | 9% |
| 256×128 | 48 | 6% |
| 128×64 | 39 | 5% |
| 512×512 | 30 | 4% |
| 8×8 | 18 | 2% |
| 32×32 | 14 | 2% |
Format distribution:
| Format | Count | Share |
|---|---|---|
| RGBA8_DECODED | 760 | 100% |
All uploads land as RGBA8 regardless of source format (INDEX16, P8, DXT, BGRA, etc.
all decode through TextureHelpers before upload). The source-format diversity is real
but invisible to GL after the decode step.
Top 10 (W, H, format) triples — atlas-opportunity input:
Same as the dimension buckets above since there is only one format. The top-3 triples (128×128, 64×64, 256×256) cover 449 of 760 surfaces = 59%.
Atlas-opportunity score: 59% of surfaces fall into the top-3 (W, H, format) triples. A conventional rule-of-thumb is that >30% concentration into the top buckets makes atlas packing worth the implementation cost for memory savings; this measurement is well above that. However, see §4 for why atlas is not the right next step despite the high score.
§4. Conclusion + next-phase recommendation
What the data shows
The entity dispatcher is strongly CPU-bound. At every radius, CPU dominates GPU by 30–50×. At radius=12 standstill: 12.9 ms CPU vs 0.34 ms GPU. At radius=12 walking the ratio is 16.3 ms CPU vs 0.55 ms GPU. There is no GPU bottleneck.
GPU is wildly under-utilized. The highest gpu_us p95 observed is 603 µs at radius=12 walking — against a 16,600 µs frame budget at 60 FPS. The GPU is working at roughly 3.6% of its 60fps capacity for entity rendering alone. Even accounting for terrain, sky, particles, UI, and swap-buffer overhead, there is substantial headroom. The "GPU comfortable" threshold (gpu_us p95 < 8,000 µs) is not even close to being challenged.
CPU grows more than linearly with N₁ (near-tier radius), but sublinearly with
visible-LB count. As N₁ grows from 4 → 8 → 12, median cpu_us grows from 3.2 ms →
6.7 ms → 12.9 ms — roughly 1.0× → 2.1× → 4.0× the r4 baseline. The visible-LB count
scales as (2N+1)²: 81 → 289 → 625, so CPU growth is sublinear in LB count (4.0×
vs 7.7× expected if every LB cost the same). Frustum culling discards most far LBs
early, but the outer per-LB walk still has to touch each one. The Tier 1 entity-
classification cache (EntityClassificationCache, shipped as #53) wins on the inner
loop (per-entity classification avoided on cache hits) but the outer walk dominates
as N₁ grows. This is exactly what the Tier 2 plan (persistent groups) at
docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses by eliminating the
per-frame LB scan entirely.
Radius=12 is not the production scenario. ACDREAM_STREAM_RADIUS=12 forces N₁=12
(625 near LBs at full detail). The production A.5 default preset is N₁=4 / N₂=12 (81
full-detail near + 544 terrain-only far), which CLAUDE.md already characterizes as
comfortable 200–400 FPS at the default preset. The numbers above characterize the scaling
curve for headroom analysis, not the experience a typical player sees.
Atlas opportunity is high (59%) but the win is memory-only — and modest. With 96 MB
of textures and 59% in the top-3 dimension buckets, atlas consolidation would let the
top buckets share single Texture2DArray objects rather than each surface owning its
own 1-layer array. The primary wins of atlas — fewer sampler switches, fewer texture
binds — are already near-zero because bindless textures are made resident once at upload
and never bound per draw. The remaining win is the per-array metadata overhead × N
surfaces, which is bounded but not dramatic given all surfaces are already power-of-two
and same-format (RGBA8). Even on the optimistic side, the absolute memory saving is on
the order of low-MB to ~10 MB, not a 40–50% halving. GPU is not bottlenecked on sampler
switches or memory bandwidth (0.6 ms gpu_us p95 at radius=12 walking demonstrates this
directly), so atlas adoption would cost 1–2 weeks of implementation risk for a memory
saving the process doesn't currently need at 96 MB.
Recommendation
Primary: do C.1.5 next (PES emitter wiring — portals, chimneys, fireplaces). Four
reasons: (a) the production dispatcher is already comfortable at the default N₁=4 preset
per the CLAUDE.md notes; (b) the two slice-2 items that were "conditional on baseline"
data (atlas adoption and persistent-mapped buffers) are not justified — GPU is not
bottlenecked; (c) C.1.5 fills a visible content gap that has been open since C.1 shipped
and is in the roadmap queue ahead of N.6 slice 2; (d) C.1.5 stabilizes the particle path
before any future shader migration work in slice 2 touches particle.frag. Starting
point for C.1.5 scoping: docs/plans/2026-04-27-phase-c1-pes-particles.md lines 285–295.
Secondary (after C.1.5 lands): N.6 slice 2 with reduced scope. The baseline data
justifies dropping atlas adoption and persistent-mapped buffers from slice 2 entirely.
What remains is a ~1-day cleanup: retire orphan mesh.frag (verify zero callers post-N.5
amendment), collapse dead _handlesByOverridden / _handlesByPalette legacy caches once
their callers are confirmed gone, migrate particle.frag to bindless sampling after C.1.5
stabilizes the path. Slice 2 is a cleanup sprint, not a performance phase.
Tertiary option (if perf escalation becomes pressing): Tier 2 first. The scaling
curve (3.2 → 6.7 → 12.9 ms as N₁ grows 4 → 8 → 12) confirms the per-LB walk is the
bottleneck — exactly what Tier 2's persistent-group structure at
docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md addresses. Not urgent at the current
default N₁=4; worth revisiting if a future quality preset wants N₁=8 as default or if the
200–400 FPS range at N₁=4 shrinks after more content is streamed.
Decision rule for revisiting: if future measurement at the default preset shows cpu_us median > 5,000 µs or gpu_us p95 > 8,000 µs, re-open the escalation question. Otherwise, hold the C.1.5 → reduced-slice-2 sequence.
§5. Raw logs
Scratch logs from this measurement run (not committed; can be deleted once the doc is reviewed):
baseline-r4-stand.log,baseline-r4-walk.logbaseline-r8-stand.log,baseline-r8-walk.logbaseline-r12-stand.log,baseline-r12-walk.logbaseline-surfaces.log(launch log forACDREAM_DUMP_SURFACES=1run)baseline-surfaces.txt(copy of%LOCALAPPDATA%\acdream\n6-surfaces.txt)task1-verify.log(Task 1 manual verification log)