Merge branch 'claude/hopeful-darwin-ae8b87' — Phase A.5 SHIP + Quality Preset system

Phase A.5 — Two-tier Streaming + Horizon LOD shipped. Headline: 2.3 km terrain horizon (radius=4 near + 12 far) with off-thread mesh build, fog blend at N₁, mipmaps + 16x AF, MSAA 4x + A2C foliage, depth-write audit, BUDGET_OVER diag, Quality Preset system (Low/Medium/ High/Ultra) with env-var overrides + F11 mid-session re-apply. ~999 tests pass, 8 pre-existing physics/input failures unchanged. Two structural-to-A.5 bug fixes shipped post-T26: - Bug A (9217fd9): far-tier worker strips entities (T13/T16 had only wired the controller side; far-tier was loading full entity layers, ~71K entities instead of ~10K, 5x perf regression). - Bug B (0ad8c99): WalkEntities scratch list reused across frames (was 480 KB / frame allocation). Tier 1 entity-classification cache attempted as polish (3639a6f), reverted (9b49009) — broke animation by caching mutable per-frame state. Retry deferred to post-A.5 polish phase (ISSUE #53). Deferred to post-A.5 polish: - Tier 1 retry with animation-mutation audit (ISSUE #53) - Lifestone missing visual (ISSUE #52) - JobKind plumbing through BuildLandblockForStreaming (ISSUE #54) - Tier 2 (static/dynamic split) + Tier 3 (GPU compute cull) — separate multi-week phases. Roadmap at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md. SHIP commit: 9245db5.
2026-05-10 10:09:03 +02:00 · 2026-05-10 10:09:03 +02:00 · d3d78fa14f
commit d3d78fa14f
parent 8f43a58037 9245db5b04
37 changed files with 6001 additions and 281 deletions
--- a/docs/plans/2026-04-11-roadmap.md
+++ b/docs/plans/2026-04-11-roadmap.md
@ -1,6 +1,6 @@
 # acdream — strategic roadmap

-**Status:** Living document. Updated 2026-05-09 for Phase N.5b shipping (terrain on the modern rendering path via Path C — mirror WB's `TerrainRenderManager` pattern, consume `LandblockMesh.Build` for retail formula compliance; closes ISSUE #51). N.6 (perf polish) remains the in-flight phase.
+**Status:** Living document. Updated 2026-05-10 for Phase A.5 shipping (two-tier streaming N₁=4/N₂=12 + QualityPreset system + Bug A/B fixes; closes the two-tier streaming spec). Post-A.5 polish (Tier 1 retry + lifestone fix + JobKind plumbing) is now the in-flight work.
 **Purpose:** One source of truth for where the project is and where it's going. Every observed defect or missing feature has a named phase that owns it; when something looks wrong in-game, look here to find the phase that'll address it. Implementation details live in per-phase specs under `docs/superpowers/specs/`, not in this file.

 ---
@ -31,6 +31,7 @@
 | A.1 | Streaming landblock loader — runtime-configurable visible window (default 5×5, `ACDREAM_STREAM_RADIUS`), camera-centered offline / player-centered live, hysteresis-based unloads, pending-spawn list for late CreateObject events | Live ✓ |
 | A.2 | Frustum culling — per-landblock AABB test (Gribb-Hartmann), terrain + static-mesh renderers skip culled landblocks, perf overlay in window title | Visual ✓ |
 | A.3 | Background net receive thread — dedicated daemon thread buffers UDP into Channel, render thread drains | Visual ✓ |
+| A.5 | Two-tier streaming + horizon LOD — N₁=4 (full detail, 81 LBs) + N₂=12 (terrain only, 544 LBs); fog blend at N₁; per-LB entity dispatcher walk tightened (Change #1 animated-walk fix + Change #2 cached AABB); single-worker off-thread mesh build; mipmaps + 16x anisotropic on TerrainAtlas; A2C with MSAA 4x on foliage; depth-write audit + lock-in test; **NEW T22.5: QualityPreset system** (Low/Medium/High/Ultra) with per-preset radii + MSAA + anisotropic + A2C + completions; env-var overrides per field; F11 mid-session re-apply. **Bug fixes post-T26 ship-prep**: (Bug A) far-tier worker now strips entities from far-tier loads — without this fix, far-tier LBs were loading their full entity layer (~71K entities) defeating the two-tier optimization; (Bug B) WalkEntities switched from per-frame fresh-list allocation to caller-provided scratch list (eliminated ~480 KB/frame GC pressure). **Deferred to post-A.5**: Tier 1 entity-classification cache (first attempt broke animation; revert + redo with animation-mutation audit), lifestone visual (missing in render), JobKind plumbing through BuildLandblockForStreaming (proper Bug A fix), Tier 2/3 perf optimizations (roadmap at docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md). Plan archived at docs/superpowers/plans/2026-05-09-phase-a5-two-tier-streaming.md. | Live ✓ |
 | B.3 | Physics MVP resolver foundation — terrain contact, CellSurface prototype, streaming-populated collision inputs, and first `PhysicsEngine` resolver path. Not the complete retail collision system. | Tests ✓ |
 | B.2 | Player movement mode — Tab-toggled WASD ground walking, walk/run/idle animations, third-person chase camera, MoveToState + AutonomousPosition outbound, portal entry. Outdoor-only MVP. | Live ✓ |
 | D.1 | 2D ortho overlay + font rendering (StbTrueTypeSharp atlas + TextRenderer + DebugOverlay) | Visual ✓ |
@ -82,7 +83,7 @@ Plus polish that doesn't get its own phase number:
 - **✓ SHIPPED — A.2 — Frustum culling.** Per-landblock AABB test (Gribb-Hartmann plane extraction + positive-vertex AABB test) in both `TerrainRenderer.Draw` and `StaticMeshRenderer.Draw`. Per-entity culling deferred. LOD deferred to Phase C. Performance overlay in window title shows FPS, frame time, visible/total landblock ratio, entity count, animated count. ~160fps uncapped at 5×5 radius.
 - **✓ SHIPPED — A.3 — Background net receive thread.** Dedicated daemon thread continuously pulls raw UDP datagrams from the kernel buffer into a `Channel<byte[]>`. Render thread's `Tick()` drains the channel. All decode, fragment assembly, ISAAC crypto, event dispatch, and ack-sending remain on the render thread — minimal change that prevents packet drops during frame stalls. Thread starts after `EnterWorld()` completes; `PumpOnce()` during handshake still reads the socket directly.
 - **A.4 — Async dat decoding.** Folded into the streaming worker — it's the worker's read path, not a separate subsystem. Called out here because regressions in dat caching could land on this surface.
- **A.5 — Two-tier streaming + terrain horizon LOD.** Split `ACDREAM_STREAM_RADIUS` into two: `ACDREAM_TERRAIN_RADIUS` (large, 8-12 cells = 1.5-2.3km) for terrain mesh + `ACDREAM_ENTITY_RADIUS` (small, 2-3 cells, current default) for entities + scenery. Distant landblocks render terrain only — no NPCs, no procedural scenery, no static objects. Tune `SceneLightingUbo`'s `uFogParams` so the far edge fades into sky color (eliminates the hard streaming boundary visible at higher radii). Optional: terrain LOD via mesh decimation for very distant chunks (combine 2×2 landblocks into one decimated mesh; cribs from `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/TerrainRenderManager.cs`). Motivation: at radius=5 today, perf scales from ~810 fps → ~200-300 fps because everything stays full-detail; both retail and WorldBuilder render terrain way out and strip entities/scenery at distance. Enables WB-style horizon visibility. **Estimate: 3-5 days for the radius split + fog tuning; +1 week if terrain LOD is included.** Not yet brainstormed.
+- **✓ SHIPPED — A.5 — Two-tier streaming + horizon LOD.** Shipped 2026-05-10. See shipped table above for full description. Plan archived at `docs/superpowers/plans/2026-05-09-phase-a5-two-tier-streaming.md`.

 **Acceptance:**
 - Walk across 10+ landblocks in any direction, no crashes, no empty voids.
@ -665,7 +666,7 @@ for our deletions/additions; merge upstream `master` periodically.
  manifest at higher radius. Spec acceptance criterion #5 was wrong;
  amended via `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`. Plan
  archived at `docs/superpowers/plans/2026-05-09-phase-n5b-terrain-modern.md`.
- **N.6 — Perf polish.** **Currently in flight.**
+- **N.6 — Perf polish.** **Planned (post-A.5 polish takes priority).**
  Builds on N.5 + N.5b. Legacy renderer retirement was pulled forward
  into N.5 ship amendment — `InstancedMeshRenderer`, `StaticMeshRenderer`,
  `WbFoundationFlag` are gone — and the terrain legacy renderer
@ -676,8 +677,8 @@ for our deletions/additions; merge upstream `master` periodically.
  is a candidate), GPU-side culling via compute pre-pass (eliminates
  the per-frame slot walk + DEIC build entirely), GL_TIME_ELAPSED query
  double-buffering (deferred from N.5 — diagnostic shows `gpu_us=0/0`
-  under `ACDREAM_WB_DIAG=1`), direct higher-radius perf comparison once
-  A.5 lands (where modern's architectural wins manifest), retire the
+  under `ACDREAM_WB_DIAG=1`), direct higher-radius perf comparison (A.5
+  has now landed — modern's architectural wins are measurable), retire the
  legacy `Texture2D`/`sampler2D` path in `TextureCache` (currently kept
  for Sky + Debug + particle paths now that Terrain has migrated).
  Plan + spec written when work begins. **Estimate: 1-2 weeks.**
--- a/docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md
+++ b/docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md
@ -0,0 +1,195 @@
+# Performance Tiers 2 + 3 — Future Roadmap
+
+**Created:** 2026-05-10 during Phase A.5 polish.
+**Status:** Future planning — not for current execution.
+**Context:** A.5 shipped two-tier streaming with the entity dispatcher landing at ~3.5ms median (post-Bug-A and Bug-B fixes). Tier 1 (entity-classification cache) lands as A.5 polish and brings the dispatcher inside the 2.0ms spec budget. Tiers 2 + 3 are the "next big perf wins" beyond Tier 1.
+
+---
+
+## Background — why this exists
+
+Discussion captured 2026-05-10: user observed 200-240 FPS at radius=12 on a Radeon 9070 XT @ 1440p and asked why an "old game like AC" doesn't deliver Unreal-level (1000+ FPS) on this hardware.
+
+The honest answer: the bottleneck is *architectural*, not hardware. The CPU is single-threaded and rebuilds the entire draw plan from scratch every frame. Modern engines pre-bake static-world batches at content-cook time and rebuild only what changes.
+
+AC's design — server-spawned per-entity world streamed at runtime — doesn't naturally batch the way Unreal's pre-cooked content does. Closing the gap requires backporting modern techniques while preserving AC's data model. Tiers 2 and 3 are that backporting work.
+
+---
+
+## Tier 2 — Static/dynamic split with persistent groups
+
+**Estimated effort:** ~10-15 days (2-week phase).
+**Estimated win:** entity dispatcher ~3.5ms → **~0.5-1ms median** at radius=12.
+**Total frame time:** ~4-5ms → **~2-3ms = 400-600 FPS at standstill.**
+
+### The core idea
+
+Today, `WbDrawDispatcher._groups` (the dictionary of "(mesh + texture + blend) → list of instances to draw") is cleared and rebuilt from scratch every frame.
+
+For trees, rocks, buildings, and other static entities (~95% of the world), the answer is identical every frame forever. Tier 2 makes the static-group instance buffers **persistent GPU-resident data**, just like Unreal's pre-baked world. The CPU only orchestrates "which groups are visible" per frame.
+
+### Architectural shift
+
+```csharp
+class StaticInstancedGroup
+{
+    public GroupKey Key;
+    public Matrix4x4[] Matrices;          // grown as entities spawn
+    public BitArray ActiveSlots;          // for free-list reuse
+    public bool NeedsGpuUpload;           // dirty flag for delta upload
+    public Dictionary<uint, int> EntityToSlot;   // for despawn lookup
+    public uint InstanceBufferOffset;     // start of group's slice in global SSBO
+}
+```
+
+**On entity spawn (atlas-tier static):** allocate a slot in each relevant group, write the matrix, mark dirty.
+
+**On entity despawn:** free the slot, mark dirty.
+
+**Per frame:**
+- Static groups: LB-cull each group (cheap). For visible groups, flag for draw. **No matrix copy. No list rebuild.**
+- Dynamic entities (~50 NPCs/players): today's per-frame walk-and-classify. Keeps the existing slow path for things that legitimately change every frame.
+- Upload only the dirty groups' matrix slices (delta upload, not full reupload).
+- Issue 2 multi-draw-indirect calls.
+
+### Sub-decisions
+
+**Frustum cull granularity at the group level:** at group level you can't reject individual instances; you draw the whole group or none of it. Two strategies:
+
+- **Per-LB subgroups:** split each group into per-landblock subgroups. LB-frustum-culls reject subgroups whose LB is invisible. ~2K groups × ~5 LBs per group on average = ~10K subgroups. Each subgroup AABB cull is ~0.3 µs → ~3 ms per frame. Roughly a wash with today's per-entity cull.
+- **Per-instance GPU cull (Tier 3):** compute pre-pass on the GPU writes which instances are visible to a draw-indirect buffer. ~0.05ms CPU. The right long-term answer.
+
+For Tier 2 alone, per-LB subgroups are the recommended approach — keep CPU culling, just at coarser granularity than per-entity.
+
+**Dynamic entities crossing LB boundaries:** when an NPC walks across a landblock boundary, it stays in the same group key but its "spatial bucket" changes. Solution: dynamic entities are tracked in a single global "dynamic group" outside the per-LB structure; they don't need spatial bucketing because there are only ~50 of them.
+
+**Palette override invalidation:** server event swaps an NPC's clothing color → group key changes. Treat as despawn-from-old + spawn-into-new. NPCs are dynamic so this just rebuckets them.
+
+**Animation overrides on static entities:** static entities don't animate. Trees don't bend (foliage wave is a vertex shader effect, not a group-key change). Buildings don't move. So the static path never invalidates.
+
+**EnvCell visibility:** dungeon entities are gated by per-cell visibility state. Need to track which group instances are tied to which cell, and during visibility cull, gate per-cell. Keep using existing `ParentCellId` field on WorldEntity.
+
+**Streaming load/unload integration:** when an LB unloads, all its static entity matrices need to be removed from their groups. Free-list management. Matches existing `LandblockSpawnAdapter` lifecycle.
+
+### Effort breakdown
+
+| Task | Days |
+|---|---|
+| Design + invariants document | 2 |
+| Spawn-time slot allocator + free-list | 3 |
+| Per-frame visibility + dirty-flag delta upload | 2 |
+| Dynamic entity path (NPCs, projectiles) | 2 |
+| Invalidation (palette/ObjDesc events) | 2 |
+| EnvCell visibility integration | 1 |
+| Streaming load/unload integration | 1 |
+| Conformance testing | 2-3 |
+| **Total** | **~10-15 days** |
+
+### Risks
+
+- **Slot management bugs** = double-frees or leaks (entities draw at random positions — visible).
+- **Invalidation bugs** = stale matrices (entity teleports back to spawn point when palette changes).
+- **Dynamic entity tracking** adds complexity around the static/dynamic boundary.
+
+### Mitigations
+
+- **Conformance test:** render a fixed scene through both pipelines, compare draw output. Adds CI infrastructure.
+- **Per-frame validation in debug:** walk all groups, assert no orphan slots.
+- **Hash invariant test:** static entities should produce stable group keys frame-over-frame. Add a debug assertion that fires once per frame in Debug builds.
+
+---
+
+## Tier 3 — GPU-side culling (compute pre-pass)
+
+**Estimated effort:** ~1 month (longer phase).
+**Estimated win:** entity dispatcher ~0.5-1ms (post-Tier-2) → **~0.05ms median.**
+**Total frame time:** ~2-3ms → **~1.5-2ms = 600-1000+ FPS at standstill.**
+
+### The core idea
+
+Today (and after Tier 2), the CPU does per-LB or per-subgroup frustum culling and tells the GPU which groups to draw.
+
+Tier 3 moves per-instance frustum cull to the GPU via a compute shader pre-pass. The CPU just uploads "here are all 1M instance matrices" once; the GPU compute shader writes which ones are visible to a draw-indirect buffer; the rasterizer draws only those.
+
+This is the level Unreal is at. With this, per-frame CPU work for the entity dispatcher becomes essentially "tell the GPU what to do" + a tiny scratch upload.
+
+### Why Tier 3 needs Tier 2 first
+
+Without Tier 2's persistent group structure, GPU culling has nothing stable to operate on. The compute shader needs an addressable "here are the static instances" buffer to read from; that buffer only exists after Tier 2.
+
+### Sub-decisions to be made
+
+**Compute shader API:** OpenGL 4.3+ compute shaders are sufficient. We're already at GL 4.3+ for bindless. No additional capability requirement.
+
+**Indirect draw command generation:** the compute shader writes a `DrawElementsIndirectCommand[]` buffer per pass. Render thread issues `glMultiDrawElementsIndirect` reading from that buffer. No CPU readback.
+
+**LOD selection:** opportunity to add per-instance LOD selection in the compute shader (distance-based mesh detail). Not needed for A.5's scope; could be a Tier 4 follow-up.
+
+**Per-light shadow map culling:** if shadows ship, GPU culling extends naturally to per-light frustum cull. Significant win for shadow rendering.
+
+### Effort breakdown
+
+| Task | Days |
+|---|---|
+| Compute shader design + GLSL implementation | 4 |
+| Buffer layout coordination with Tier 2 | 2 |
+| Silk.NET compute dispatch integration | 3 |
+| Indirect command compaction logic | 4 |
+| LOD selection (optional, ~stretch) | 4 |
+| Validation: per-instance cull matches CPU cull within epsilon | 3 |
+| Conformance + regression testing | 5 |
+| **Total** | **~21-25 days, ~1 month** |
+
+### Risks
+
+- **GPU stalls** if the compute shader takes longer than expected (esp. on lower-end GPUs).
+- **Sync overhead** between compute pre-pass and rasterizer pass.
+- **Debugging difficulty** — GPU compute bugs are harder to diagnose than CPU bugs.
+
+### Mitigations
+
+- **Profile-driven design:** measure compute shader runtime on target hardware before committing.
+- **Fallback path:** keep CPU cull as a runtime-toggleable option (env var) so we can A/B compare.
+- **GPU debugging tools:** RenderDoc captures + frame-by-frame compute shader inspection.
+
+---
+
+## When to schedule these
+
+**Tier 2:**
+- Best fit: dedicated 2-week phase after a SHIP cycle. Treat it like a Phase B/C/N (i.e., name it Phase A.6 or N.7).
+- Trigger: user wants to push radius beyond 12 (e.g., to 15 or 20 for true continent-scale horizon).
+- Trigger: user wants to add 100+ active NPCs in a city without dropping below 240Hz.
+
+**Tier 3:**
+- Best fit: after Tier 2 has been live and stable for at least one cycle.
+- Trigger: shadow map work begins (GPU cull + shadow cull share the same compute pre-pass infrastructure).
+- Trigger: user wants 500+ FPS sustained for very-high-refresh scenarios (360Hz monitors, future hardware).
+
+**Both:**
+- Don't bundle with other phases. These are dedicated perf phases with their own brainstorm + spec + plan + SHIP cycles.
+
+---
+
+## What's "free" or smaller (out of Tier 1/2/3 scope but worth noting)
+
+- **Plumb `JobKind` properly through `BuildLandblockForStreaming`** (~30 min). Today's Bug A patch wastes worker-thread CPU on hydration that gets thrown away for far-tier. Cleaner code, slight CPU savings on worker.
+- **Eliminate `ToEntries` adapter allocation in `Draw`** (~15 min). Tiny win (~25 KB / frame). Could fold into Tier 1.
+- **Persistent-mapped indirect buffer** (~2 days). Today's `glBufferData` per frame becomes a pre-mapped persistent buffer. Marginal win on RDNA 4; meaningful on lower-end GPUs.
+- **Multi-thread mesh-build worker pool** (~1 day). 2.7s first-traversal horizon-fill drops to 0.7s with 4 workers. UX win on first walk-into-region.
+
+These are good candidates for a "perf polish" mini-phase or to backfill into Tier 2.
+
+---
+
+## The architectural ceiling
+
+Even with all three tiers, **a faithful AC client written in C# with bindless OpenGL tops out around 800-1500 FPS at radius=12 on RDNA 4 hardware**. Beyond that requires:
+
+- Native C++ rendering core (eliminate .NET GC + JIT overhead)
+- DX12/Vulkan API (eliminate driver state validation)
+- Offline content cooking (eliminate runtime mesh/texture decode)
+
+Each of those is a several-month undertaking and represents "becoming a different engine." The realistic target for acdream is 240-500 FPS at the user's monitor refresh, comfortably ahead of the visible-stutter threshold. Tier 1 + Tier 2 alone should deliver that for radius=12-15.
+
+For "Unreal-level FPS at full quality," that's a different project.