acdream/docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md
Erik 462f9d6377 docs(perf): roadmap for Tier 2 + Tier 3 entity-dispatcher optimizations
Captured 2026-05-10 during Phase A.5 polish discussion. User asked why
the 9070 XT @ 1440p doesn't hit Unreal-level FPS for an old game like
AC. Answer: architectural — we rebuild the entire draw plan from
scratch every frame instead of caching pre-baked static-world data.

Tier 1 (entity-classification cache) lands as A.5 polish (separate
commit). Tiers 2 + 3 documented here for future scheduling:

- Tier 2 — Static/dynamic split with persistent groups
  ~2-week phase. Static entities (~95% of world) get permanent GPU-
  resident matrix slots, populated at spawn, dirty-tracked for delta
  upload. Per-frame CPU cost for static = LB-cull + dirty-flag check
  only. Estimated entity dispatcher: 3.5ms → 0.5-1ms median.
  400-600 FPS at standstill, radius=12.

- Tier 3 — GPU-side culling (compute pre-pass)
  ~1-month phase. Per-instance frustum cull moves to GPU compute
  shader. Compute writes draw-indirect buffer; rasterizer reads it.
  Estimated CPU dispatcher: ~0.05ms (essentially free).
  600-1000+ FPS at standstill, radius=12.

Doc captures effort estimates, sub-decisions, risks, mitigations, and
scheduling triggers for each tier. Also notes the architectural
ceiling (~800-1500 FPS for a C# + GL client; reaching native engine
performance requires becoming a different engine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 09:38:38 +02:00

195 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Performance Tiers 2 + 3 — Future Roadmap
**Created:** 2026-05-10 during Phase A.5 polish.
**Status:** Future planning — not for current execution.
**Context:** A.5 shipped two-tier streaming with the entity dispatcher landing at ~3.5ms median (post-Bug-A and Bug-B fixes). Tier 1 (entity-classification cache) lands as A.5 polish and brings the dispatcher inside the 2.0ms spec budget. Tiers 2 + 3 are the "next big perf wins" beyond Tier 1.
---
## Background — why this exists
Discussion captured 2026-05-10: user observed 200-240 FPS at radius=12 on a Radeon 9070 XT @ 1440p and asked why an "old game like AC" doesn't deliver Unreal-level (1000+ FPS) on this hardware.
The honest answer: the bottleneck is *architectural*, not hardware. The CPU is single-threaded and rebuilds the entire draw plan from scratch every frame. Modern engines pre-bake static-world batches at content-cook time and rebuild only what changes.
AC's design — server-spawned per-entity world streamed at runtime — doesn't naturally batch the way Unreal's pre-cooked content does. Closing the gap requires backporting modern techniques while preserving AC's data model. Tiers 2 and 3 are that backporting work.
---
## Tier 2 — Static/dynamic split with persistent groups
**Estimated effort:** ~10-15 days (2-week phase).
**Estimated win:** entity dispatcher ~3.5ms → **~0.5-1ms median** at radius=12.
**Total frame time:** ~4-5ms → **~2-3ms = 400-600 FPS at standstill.**
### The core idea
Today, `WbDrawDispatcher._groups` (the dictionary of "(mesh + texture + blend) → list of instances to draw") is cleared and rebuilt from scratch every frame.
For trees, rocks, buildings, and other static entities (~95% of the world), the answer is identical every frame forever. Tier 2 makes the static-group instance buffers **persistent GPU-resident data**, just like Unreal's pre-baked world. The CPU only orchestrates "which groups are visible" per frame.
### Architectural shift
```csharp
class StaticInstancedGroup
{
public GroupKey Key;
public Matrix4x4[] Matrices; // grown as entities spawn
public BitArray ActiveSlots; // for free-list reuse
public bool NeedsGpuUpload; // dirty flag for delta upload
public Dictionary<uint, int> EntityToSlot; // for despawn lookup
public uint InstanceBufferOffset; // start of group's slice in global SSBO
}
```
**On entity spawn (atlas-tier static):** allocate a slot in each relevant group, write the matrix, mark dirty.
**On entity despawn:** free the slot, mark dirty.
**Per frame:**
- Static groups: LB-cull each group (cheap). For visible groups, flag for draw. **No matrix copy. No list rebuild.**
- Dynamic entities (~50 NPCs/players): today's per-frame walk-and-classify. Keeps the existing slow path for things that legitimately change every frame.
- Upload only the dirty groups' matrix slices (delta upload, not full reupload).
- Issue 2 multi-draw-indirect calls.
### Sub-decisions
**Frustum cull granularity at the group level:** at group level you can't reject individual instances; you draw the whole group or none of it. Two strategies:
- **Per-LB subgroups:** split each group into per-landblock subgroups. LB-frustum-culls reject subgroups whose LB is invisible. ~2K groups × ~5 LBs per group on average = ~10K subgroups. Each subgroup AABB cull is ~0.3 µs → ~3 ms per frame. Roughly a wash with today's per-entity cull.
- **Per-instance GPU cull (Tier 3):** compute pre-pass on the GPU writes which instances are visible to a draw-indirect buffer. ~0.05ms CPU. The right long-term answer.
For Tier 2 alone, per-LB subgroups are the recommended approach — keep CPU culling, just at coarser granularity than per-entity.
**Dynamic entities crossing LB boundaries:** when an NPC walks across a landblock boundary, it stays in the same group key but its "spatial bucket" changes. Solution: dynamic entities are tracked in a single global "dynamic group" outside the per-LB structure; they don't need spatial bucketing because there are only ~50 of them.
**Palette override invalidation:** server event swaps an NPC's clothing color → group key changes. Treat as despawn-from-old + spawn-into-new. NPCs are dynamic so this just rebuckets them.
**Animation overrides on static entities:** static entities don't animate. Trees don't bend (foliage wave is a vertex shader effect, not a group-key change). Buildings don't move. So the static path never invalidates.
**EnvCell visibility:** dungeon entities are gated by per-cell visibility state. Need to track which group instances are tied to which cell, and during visibility cull, gate per-cell. Keep using existing `ParentCellId` field on WorldEntity.
**Streaming load/unload integration:** when an LB unloads, all its static entity matrices need to be removed from their groups. Free-list management. Matches existing `LandblockSpawnAdapter` lifecycle.
### Effort breakdown
| Task | Days |
|---|---|
| Design + invariants document | 2 |
| Spawn-time slot allocator + free-list | 3 |
| Per-frame visibility + dirty-flag delta upload | 2 |
| Dynamic entity path (NPCs, projectiles) | 2 |
| Invalidation (palette/ObjDesc events) | 2 |
| EnvCell visibility integration | 1 |
| Streaming load/unload integration | 1 |
| Conformance testing | 2-3 |
| **Total** | **~10-15 days** |
### Risks
- **Slot management bugs** = double-frees or leaks (entities draw at random positions — visible).
- **Invalidation bugs** = stale matrices (entity teleports back to spawn point when palette changes).
- **Dynamic entity tracking** adds complexity around the static/dynamic boundary.
### Mitigations
- **Conformance test:** render a fixed scene through both pipelines, compare draw output. Adds CI infrastructure.
- **Per-frame validation in debug:** walk all groups, assert no orphan slots.
- **Hash invariant test:** static entities should produce stable group keys frame-over-frame. Add a debug assertion that fires once per frame in Debug builds.
---
## Tier 3 — GPU-side culling (compute pre-pass)
**Estimated effort:** ~1 month (longer phase).
**Estimated win:** entity dispatcher ~0.5-1ms (post-Tier-2) → **~0.05ms median.**
**Total frame time:** ~2-3ms → **~1.5-2ms = 600-1000+ FPS at standstill.**
### The core idea
Today (and after Tier 2), the CPU does per-LB or per-subgroup frustum culling and tells the GPU which groups to draw.
Tier 3 moves per-instance frustum cull to the GPU via a compute shader pre-pass. The CPU just uploads "here are all 1M instance matrices" once; the GPU compute shader writes which ones are visible to a draw-indirect buffer; the rasterizer draws only those.
This is the level Unreal is at. With this, per-frame CPU work for the entity dispatcher becomes essentially "tell the GPU what to do" + a tiny scratch upload.
### Why Tier 3 needs Tier 2 first
Without Tier 2's persistent group structure, GPU culling has nothing stable to operate on. The compute shader needs an addressable "here are the static instances" buffer to read from; that buffer only exists after Tier 2.
### Sub-decisions to be made
**Compute shader API:** OpenGL 4.3+ compute shaders are sufficient. We're already at GL 4.3+ for bindless. No additional capability requirement.
**Indirect draw command generation:** the compute shader writes a `DrawElementsIndirectCommand[]` buffer per pass. Render thread issues `glMultiDrawElementsIndirect` reading from that buffer. No CPU readback.
**LOD selection:** opportunity to add per-instance LOD selection in the compute shader (distance-based mesh detail). Not needed for A.5's scope; could be a Tier 4 follow-up.
**Per-light shadow map culling:** if shadows ship, GPU culling extends naturally to per-light frustum cull. Significant win for shadow rendering.
### Effort breakdown
| Task | Days |
|---|---|
| Compute shader design + GLSL implementation | 4 |
| Buffer layout coordination with Tier 2 | 2 |
| Silk.NET compute dispatch integration | 3 |
| Indirect command compaction logic | 4 |
| LOD selection (optional, ~stretch) | 4 |
| Validation: per-instance cull matches CPU cull within epsilon | 3 |
| Conformance + regression testing | 5 |
| **Total** | **~21-25 days, ~1 month** |
### Risks
- **GPU stalls** if the compute shader takes longer than expected (esp. on lower-end GPUs).
- **Sync overhead** between compute pre-pass and rasterizer pass.
- **Debugging difficulty** — GPU compute bugs are harder to diagnose than CPU bugs.
### Mitigations
- **Profile-driven design:** measure compute shader runtime on target hardware before committing.
- **Fallback path:** keep CPU cull as a runtime-toggleable option (env var) so we can A/B compare.
- **GPU debugging tools:** RenderDoc captures + frame-by-frame compute shader inspection.
---
## When to schedule these
**Tier 2:**
- Best fit: dedicated 2-week phase after a SHIP cycle. Treat it like a Phase B/C/N (i.e., name it Phase A.6 or N.7).
- Trigger: user wants to push radius beyond 12 (e.g., to 15 or 20 for true continent-scale horizon).
- Trigger: user wants to add 100+ active NPCs in a city without dropping below 240Hz.
**Tier 3:**
- Best fit: after Tier 2 has been live and stable for at least one cycle.
- Trigger: shadow map work begins (GPU cull + shadow cull share the same compute pre-pass infrastructure).
- Trigger: user wants 500+ FPS sustained for very-high-refresh scenarios (360Hz monitors, future hardware).
**Both:**
- Don't bundle with other phases. These are dedicated perf phases with their own brainstorm + spec + plan + SHIP cycles.
---
## What's "free" or smaller (out of Tier 1/2/3 scope but worth noting)
- **Plumb `JobKind` properly through `BuildLandblockForStreaming`** (~30 min). Today's Bug A patch wastes worker-thread CPU on hydration that gets thrown away for far-tier. Cleaner code, slight CPU savings on worker.
- **Eliminate `ToEntries` adapter allocation in `Draw`** (~15 min). Tiny win (~25 KB / frame). Could fold into Tier 1.
- **Persistent-mapped indirect buffer** (~2 days). Today's `glBufferData` per frame becomes a pre-mapped persistent buffer. Marginal win on RDNA 4; meaningful on lower-end GPUs.
- **Multi-thread mesh-build worker pool** (~1 day). 2.7s first-traversal horizon-fill drops to 0.7s with 4 workers. UX win on first walk-into-region.
These are good candidates for a "perf polish" mini-phase or to backfill into Tier 2.
---
## The architectural ceiling
Even with all three tiers, **a faithful AC client written in C# with bindless OpenGL tops out around 800-1500 FPS at radius=12 on RDNA 4 hardware**. Beyond that requires:
- Native C++ rendering core (eliminate .NET GC + JIT overhead)
- DX12/Vulkan API (eliminate driver state validation)
- Offline content cooking (eliminate runtime mesh/texture decode)
Each of those is a several-month undertaking and represents "becoming a different engine." The realistic target for acdream is 240-500 FPS at the user's monitor refresh, comfortably ahead of the visible-stutter threshold. Tier 1 + Tier 2 alone should deliver that for radius=12-15.
For "Unreal-level FPS at full quality," that's a different project.