docs(perf): roadmap for Tier 2 + Tier 3 entity-dispatcher optimizations

Captured 2026-05-10 during Phase A.5 polish discussion. User asked why the 9070 XT @ 1440p doesn't hit Unreal-level FPS for an old game like AC. Answer: architectural — we rebuild the entire draw plan from scratch every frame instead of caching pre-baked static-world data. Tier 1 (entity-classification cache) lands as A.5 polish (separate commit). Tiers 2 + 3 documented here for future scheduling: - Tier 2 — Static/dynamic split with persistent groups ~2-week phase. Static entities (~95% of world) get permanent GPU- resident matrix slots, populated at spawn, dirty-tracked for delta upload. Per-frame CPU cost for static = LB-cull + dirty-flag check only. Estimated entity dispatcher: 3.5ms → 0.5-1ms median. 400-600 FPS at standstill, radius=12. - Tier 3 — GPU-side culling (compute pre-pass) ~1-month phase. Per-instance frustum cull moves to GPU compute shader. Compute writes draw-indirect buffer; rasterizer reads it. Estimated CPU dispatcher: ~0.05ms (essentially free). 600-1000+ FPS at standstill, radius=12. Doc captures effort estimates, sub-decisions, risks, mitigations, and scheduling triggers for each tier. Also notes the architectural ceiling (~800-1500 FPS for a C# + GL client; reaching native engine performance requires becoming a different engine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 09:38:38 +02:00 · 2026-05-10 09:38:38 +02:00 · 462f9d6377
commit 462f9d6377
parent 0ad8c99c37
1 changed files with 195 additions and 0 deletions
--- a/docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md
+++ b/docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md
@ -0,0 +1,195 @@
 # Performance Tiers 2 + 3 — Future Roadmap
 **Created:** 2026-05-10 during Phase A.5 polish.
 **Status:** Future planning — not for current execution.
 **Context:** A.5 shipped two-tier streaming with the entity dispatcher landing at ~3.5ms median (post-Bug-A and Bug-B fixes). Tier 1 (entity-classification cache) lands as A.5 polish and brings the dispatcher inside the 2.0ms spec budget. Tiers 2 + 3 are the "next big perf wins" beyond Tier 1.
 ---
 ## Background — why this exists
 Discussion captured 2026-05-10: user observed 200-240 FPS at radius=12 on a Radeon 9070 XT @ 1440p and asked why an "old game like AC" doesn't deliver Unreal-level (1000+ FPS) on this hardware.
 The honest answer: the bottleneck is *architectural*, not hardware. The CPU is single-threaded and rebuilds the entire draw plan from scratch every frame. Modern engines pre-bake static-world batches at content-cook time and rebuild only what changes.
 AC's design — server-spawned per-entity world streamed at runtime — doesn't naturally batch the way Unreal's pre-cooked content does. Closing the gap requires backporting modern techniques while preserving AC's data model. Tiers 2 and 3 are that backporting work.
 ---
 ## Tier 2 — Static/dynamic split with persistent groups
 **Estimated effort:** ~10-15 days (2-week phase).
 **Estimated win:** entity dispatcher ~3.5ms → **~0.5-1ms median** at radius=12.
 **Total frame time:** ~4-5ms → **~2-3ms = 400-600 FPS at standstill.**
 ### The core idea
 Today, `WbDrawDispatcher._groups` (the dictionary of "(mesh + texture + blend) → list of instances to draw") is cleared and rebuilt from scratch every frame.
 For trees, rocks, buildings, and other static entities (~95% of the world), the answer is identical every frame forever. Tier 2 makes the static-group instance buffers **persistent GPU-resident data**, just like Unreal's pre-baked world. The CPU only orchestrates "which groups are visible" per frame.
 ### Architectural shift
 ```csharp
 class StaticInstancedGroup
 {
    public GroupKey Key;
    public Matrix4x4[] Matrices;          // grown as entities spawn
    public BitArray ActiveSlots;          // for free-list reuse
    public bool NeedsGpuUpload;           // dirty flag for delta upload
    public Dictionary<uint, int> EntityToSlot;   // for despawn lookup
    public uint InstanceBufferOffset;     // start of group's slice in global SSBO
 }
 ```
 **On entity spawn (atlas-tier static):** allocate a slot in each relevant group, write the matrix, mark dirty.
 **On entity despawn:** free the slot, mark dirty.
 **Per frame:**
 - Static groups: LB-cull each group (cheap). For visible groups, flag for draw. **No matrix copy. No list rebuild.**
 - Dynamic entities (~50 NPCs/players): today's per-frame walk-and-classify. Keeps the existing slow path for things that legitimately change every frame.
 - Upload only the dirty groups' matrix slices (delta upload, not full reupload).
 - Issue 2 multi-draw-indirect calls.
 ### Sub-decisions
 **Frustum cull granularity at the group level:** at group level you can't reject individual instances; you draw the whole group or none of it. Two strategies:
 - **Per-LB subgroups:** split each group into per-landblock subgroups. LB-frustum-culls reject subgroups whose LB is invisible. ~2K groups × ~5 LBs per group on average = ~10K subgroups. Each subgroup AABB cull is ~0.3 µs → ~3 ms per frame. Roughly a wash with today's per-entity cull.
 - **Per-instance GPU cull (Tier 3):** compute pre-pass on the GPU writes which instances are visible to a draw-indirect buffer. ~0.05ms CPU. The right long-term answer.
 For Tier 2 alone, per-LB subgroups are the recommended approach — keep CPU culling, just at coarser granularity than per-entity.
 **Dynamic entities crossing LB boundaries:** when an NPC walks across a landblock boundary, it stays in the same group key but its "spatial bucket" changes. Solution: dynamic entities are tracked in a single global "dynamic group" outside the per-LB structure; they don't need spatial bucketing because there are only ~50 of them.
 **Palette override invalidation:** server event swaps an NPC's clothing color → group key changes. Treat as despawn-from-old + spawn-into-new. NPCs are dynamic so this just rebuckets them.
 **Animation overrides on static entities:** static entities don't animate. Trees don't bend (foliage wave is a vertex shader effect, not a group-key change). Buildings don't move. So the static path never invalidates.
 **EnvCell visibility:** dungeon entities are gated by per-cell visibility state. Need to track which group instances are tied to which cell, and during visibility cull, gate per-cell. Keep using existing `ParentCellId` field on WorldEntity.
 **Streaming load/unload integration:** when an LB unloads, all its static entity matrices need to be removed from their groups. Free-list management. Matches existing `LandblockSpawnAdapter` lifecycle.
 ### Effort breakdown
 | Task | Days |
 |---|---|
 | Design + invariants document | 2 |
 | Spawn-time slot allocator + free-list | 3 |
 | Per-frame visibility + dirty-flag delta upload | 2 |
 | Dynamic entity path (NPCs, projectiles) | 2 |
 | Invalidation (palette/ObjDesc events) | 2 |
 | EnvCell visibility integration | 1 |
 | Streaming load/unload integration | 1 |
 | Conformance testing | 2-3 |
 | **Total** | **~10-15 days** |
 ### Risks
 - **Slot management bugs** = double-frees or leaks (entities draw at random positions — visible).
 - **Invalidation bugs** = stale matrices (entity teleports back to spawn point when palette changes).
 - **Dynamic entity tracking** adds complexity around the static/dynamic boundary.
 ### Mitigations
 - **Conformance test:** render a fixed scene through both pipelines, compare draw output. Adds CI infrastructure.
 - **Per-frame validation in debug:** walk all groups, assert no orphan slots.
 - **Hash invariant test:** static entities should produce stable group keys frame-over-frame. Add a debug assertion that fires once per frame in Debug builds.
 ---
 ## Tier 3 — GPU-side culling (compute pre-pass)
 **Estimated effort:** ~1 month (longer phase).
 **Estimated win:** entity dispatcher ~0.5-1ms (post-Tier-2) → **~0.05ms median.**
 **Total frame time:** ~2-3ms → **~1.5-2ms = 600-1000+ FPS at standstill.**
 ### The core idea
 Today (and after Tier 2), the CPU does per-LB or per-subgroup frustum culling and tells the GPU which groups to draw.
 Tier 3 moves per-instance frustum cull to the GPU via a compute shader pre-pass. The CPU just uploads "here are all 1M instance matrices" once; the GPU compute shader writes which ones are visible to a draw-indirect buffer; the rasterizer draws only those.
 This is the level Unreal is at. With this, per-frame CPU work for the entity dispatcher becomes essentially "tell the GPU what to do" + a tiny scratch upload.
 ### Why Tier 3 needs Tier 2 first
 Without Tier 2's persistent group structure, GPU culling has nothing stable to operate on. The compute shader needs an addressable "here are the static instances" buffer to read from; that buffer only exists after Tier 2.
 ### Sub-decisions to be made
 **Compute shader API:** OpenGL 4.3+ compute shaders are sufficient. We're already at GL 4.3+ for bindless. No additional capability requirement.
 **Indirect draw command generation:** the compute shader writes a `DrawElementsIndirectCommand[]` buffer per pass. Render thread issues `glMultiDrawElementsIndirect` reading from that buffer. No CPU readback.
 **LOD selection:** opportunity to add per-instance LOD selection in the compute shader (distance-based mesh detail). Not needed for A.5's scope; could be a Tier 4 follow-up.
 **Per-light shadow map culling:** if shadows ship, GPU culling extends naturally to per-light frustum cull. Significant win for shadow rendering.
 ### Effort breakdown
 | Task | Days |
 |---|---|
 | Compute shader design + GLSL implementation | 4 |
 | Buffer layout coordination with Tier 2 | 2 |
 | Silk.NET compute dispatch integration | 3 |
 | Indirect command compaction logic | 4 |
 | LOD selection (optional, ~stretch) | 4 |
 | Validation: per-instance cull matches CPU cull within epsilon | 3 |
 | Conformance + regression testing | 5 |
 | **Total** | **~21-25 days, ~1 month** |
 ### Risks
 - **GPU stalls** if the compute shader takes longer than expected (esp. on lower-end GPUs).
 - **Sync overhead** between compute pre-pass and rasterizer pass.
 - **Debugging difficulty** — GPU compute bugs are harder to diagnose than CPU bugs.
 ### Mitigations
 - **Profile-driven design:** measure compute shader runtime on target hardware before committing.
 - **Fallback path:** keep CPU cull as a runtime-toggleable option (env var) so we can A/B compare.
 - **GPU debugging tools:** RenderDoc captures + frame-by-frame compute shader inspection.
 ---
 ## When to schedule these
 **Tier 2:**
 - Best fit: dedicated 2-week phase after a SHIP cycle. Treat it like a Phase B/C/N (i.e., name it Phase A.6 or N.7).
 - Trigger: user wants to push radius beyond 12 (e.g., to 15 or 20 for true continent-scale horizon).
 - Trigger: user wants to add 100+ active NPCs in a city without dropping below 240Hz.
 **Tier 3:**
 - Best fit: after Tier 2 has been live and stable for at least one cycle.
 - Trigger: shadow map work begins (GPU cull + shadow cull share the same compute pre-pass infrastructure).
 - Trigger: user wants 500+ FPS sustained for very-high-refresh scenarios (360Hz monitors, future hardware).
 **Both:**
 - Don't bundle with other phases. These are dedicated perf phases with their own brainstorm + spec + plan + SHIP cycles.
 ---
 ## What's "free" or smaller (out of Tier 1/2/3 scope but worth noting)
 - **Plumb `JobKind` properly through `BuildLandblockForStreaming`** (~30 min). Today's Bug A patch wastes worker-thread CPU on hydration that gets thrown away for far-tier. Cleaner code, slight CPU savings on worker.
 - **Eliminate `ToEntries` adapter allocation in `Draw`** (~15 min). Tiny win (~25 KB / frame). Could fold into Tier 1.
 - **Persistent-mapped indirect buffer** (~2 days). Today's `glBufferData` per frame becomes a pre-mapped persistent buffer. Marginal win on RDNA 4; meaningful on lower-end GPUs.
 - **Multi-thread mesh-build worker pool** (~1 day). 2.7s first-traversal horizon-fill drops to 0.7s with 4 workers. UX win on first walk-into-region.
 These are good candidates for a "perf polish" mini-phase or to backfill into Tier 2.
 ---
 ## The architectural ceiling
 Even with all three tiers, **a faithful AC client written in C# with bindless OpenGL tops out around 800-1500 FPS at radius=12 on RDNA 4 hardware**. Beyond that requires:
 - Native C++ rendering core (eliminate .NET GC + JIT overhead)
 - DX12/Vulkan API (eliminate driver state validation)
 - Offline content cooking (eliminate runtime mesh/texture decode)
 Each of those is a several-month undertaking and represents "becoming a different engine." The realistic target for acdream is 240-500 FPS at the user's monitor refresh, comfortably ahead of the visible-stutter threshold. Tier 1 + Tier 2 alone should deliver that for radius=12-15.
 For "Unreal-level FPS at full quality," that's a different project.