diff --git a/docs/research/2026-05-10-phase-a5-handoff.md b/docs/research/2026-05-10-phase-a5-handoff.md new file mode 100644 index 0000000..ae70602 --- /dev/null +++ b/docs/research/2026-05-10-phase-a5-handoff.md @@ -0,0 +1,376 @@ +# Phase A.5 — Two-tier Streaming + Horizon LOD — Cold-Start Handoff + +**Created:** 2026-05-10, immediately after N.5b ship. +**Audience:** the next agent picking up streaming + horizon-LOD work. +**Purpose:** brief you on where N.5b left things, what A.5 actually has to do +to make the world look and feel great, and the load-bearing facts the +brainstorm should be informed by. + +--- + +## TL;DR + +N.5b just shipped: outdoor terrain rendering is on bindless + multi-draw +indirect via `TerrainModernRenderer`. Constant-cost dispatch as the +visible landblock count grows — radius=5 vs radius=15 are the same number +of GL calls for terrain. + +**A.5's actual goal — verbatim from the user, 2026-05-09:** + +> "I just want great smooth HIGH fps visuals. Should look great. As long +> as it scales and we get very high FPS" + +That reframes priorities. We are NOT optimizing the inner loop at radius=5 +(it's solved). We're scaling visual reach + scene density without the +client falling off a perf cliff. + +**Concretely, A.5 ships three things:** + +1. **Two-tier streaming.** Near tier (≤ N₁ landblocks) loads everything as + today (terrain + scenery + EnvCells + collision). Far tier (N₁ < r ≤ N₂) + loads terrain mesh ONLY. No scenery generation, no collision, no + entity registration for the far tier. +2. **Per-LB entity bucketing for the WB dispatcher.** Today the entity + dispatcher walks every loaded entity each frame for AABB cull — + ~16K entities @ ~1µs/test = 4.3ms/frame, dominating the frame budget. + Bucket entities by landblock so the cull is hierarchical: cull the LB + first, then only walk entities inside surviving LBs. +3. **Off-thread mesh build.** `LandblockMesh.Build` currently runs on the + render thread when a new LB streams in. At today's radius=5 this is + invisible; at A.5's higher N₂ it becomes a visible frame-time spike + when 4-5 LBs stream simultaneously. Move the build to a worker pool; + hand finished `LandblockMeshData` back via a queue. + +The headline win you're shooting for: **radius=15 sustains the user's +target FPS in Holtburg with no streaming hitches.** + +--- + +## Where N.5b left things + +### Branch state (relative to main) + +After N.5b ships: +- N.5b SHIP at `08b7362` (final commit; appended SHIP record to plan) +- Roadmap entry, issue #51 closure, perf baseline doc all in place at `083c10c` +- Legacy `TerrainChunkRenderer` + `TerrainRenderer` + `terrain.vert/.frag` + deleted at `7dfa2af`. **The modern path is the only path.** + +### Captured perf baseline (load-bearing for A.5's "what's actually hot") + +From `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`, measured +2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill: + +| Subsystem | cpu_us median per frame | Notes | +|---|---|---| +| **Entity dispatcher** (`WbDrawDispatcher`) | **~4,300** | 86% of frame budget. ~16K entities walked for AABB cull. THIS is the bottleneck. | +| Terrain dispatcher (`TerrainModernRenderer`) | ~6.4 | <1% of frame. Constant-cost regardless of radius (proved in N.5b). | +| Everything else (sky, particles, ImGui, swap, audio) | ~700 | Small. | + +**Actual FPS at radius=5 in Holtburg: ~200 fps** (frame time ≈ 5ms). +NOT the "810 fps" inferred from the N.5 ship doc (that was 1/dispatcher_ms, +which is only the WB dispatcher CPU cost in isolation, not real frame time). + +### What naive radius increase does + +If you simply raised `ACDREAM_STREAM_RADIUS` to 15 today without A.5: + +- Loaded landblocks: 121 → ~961 (8× more). Acceptable. +- Loaded entities: ~16K → ~125K (linear scaling with LB count). **NOT + acceptable.** At ~1µs per AABB cull, the entity dispatcher would take + ~125ms/frame = 8 FPS. Slideshow. +- Memory footprint: similar 8× explosion in scenery instance buffers. + +So the perf cliff is real and immediate. A.5 has to address it BEFORE +the radius can be safely raised. + +### What N.5b set up that A.5 inherits + +- **Modern terrain dispatcher.** `TerrainModernRenderer` is O(1) GL calls + in radius. As you add far-tier LBs (terrain only), the terrain + dispatcher cost stays flat (~6µs/frame). This is the one subsystem + that doesn't need any A.5 work — it just scales. +- **Slot allocator for terrain GPU buffers.** Already grows by power-of-two + doubling. Will absorb radius=15 (~961 slots × ~15 KB each = ~14 MB) + without manual tuning. +- **`[TERRAIN-DIAG]` instrumentation.** Reports per-frame median + p95 in + microseconds. Use this to confirm A.5 doesn't regress terrain perf. +- **Conformance sentinel.** `TerrainModernConformanceTests` proves visual + mesh Z agrees with `TerrainSurface.SampleZFromHeightmap` to 0.015 mm. + Don't break this — physics ↔ visual agreement must hold across both + tiers. +- **Bindless atlas.** `TerrainAtlas.GetBindlessHandles()`. The far tier + shares the atlas (it's region-wide). Zero atlas-related per-LB cost. + +--- + +## The brainstorm questions (the hard calls A.5 has to make) + +These are the questions to resolve in the brainstorm step. Bring them to +the user with options + recommendation; don't prejudge. + +### 1. Tier radii: what are N₁ and N₂? + +- **N₁** = near-tier radius (everything loads). Today's default `STREAM_RADIUS`. + Probably stays at 5 (or maybe 4; maybe 3). +- **N₂** = far-tier radius (terrain mesh only). Could be 8, 12, 15, 20. + +Tradeoffs: bigger N₂ = more world visible = looks better. But each far-tier +LB still costs ~16 KB GPU memory + a frustum cull AABB + a slot allocation. +At N₂=15, that's ~961 LBs × 16 KB = ~15 MB GPU mem (cheap) + ~961 cull +tests (cheap, ~1ms total at 1µs each — and we'll do this per-LB cull +anyway as part of #2 below). + +Verify against retail: cdb attach + check how many landblocks retail keeps +loaded at a given vantage point. Probably around 10-12 per the AC2D +references and the holtburger client's behavior. + +### 2. Far tier: terrain only? Or also impostor scenery? + +Two options: +- **Terrain only** (cleanest). Beyond N₁, no trees, no rocks. Skyline is the + terrain mesh against the sky. +- **Impostor scenery** (more retail-like). Beyond N₁, generate flat + billboards or low-poly trees instead of full meshes. Adds substantial + complexity (billboard pipeline, mesh-LOD generation, per-camera-angle + rotation). + +Recommendation: start with terrain-only. Add impostors only if the +horizon looks wrong (too bare). Retail definitely has SOME distant +scenery but the cutoff is gradual; we can match it later if needed. + +### 3. Entity bucketing structure + +Today: `WbDrawDispatcher` keeps a flat dictionary of all entities and +walks all of them per frame. To bucket by LB, we need: + +- A `Dictionary>` keyed by landblock ID +- On `AddEntity(...)`, also stash it in the LB bucket (the spawn flow + already knows the LB context) +- On `RemoveEntity(...)`, remove from the LB bucket too +- Per frame: cull at LB granularity first; then cull entities only inside + surviving LBs + +LB-level AABBs are already computed (per the existing `_visibleSlots` +logic in `TerrainModernRenderer` — the same AABB applies to entities, +modulo a Z-range bump for trees/buildings). + +Open question: do entities outside a known LB exist? (Items dropped on the +ground? Ephemeral effects? Player projectiles?) If yes, they need a +fallback "unknown LB" bucket that's still walked every frame. Probably +small. + +### 4. Where does the off-thread mesh build land? + +Today `LandblockMesh.Build` runs synchronously inside `OnLandblockLoaded` +on the render thread. To move it off: + +- `StreamingLoader` worker thread (already async for dat reads) signals + "LB X is ready" +- A new worker pool consumes that signal, builds the mesh on a worker + thread, posts the finished `LandblockMeshData` to a `ConcurrentQueue` +- Render thread drains the queue at the start of each frame, calling + `_terrain.AddLandblock(...)` for each ready mesh + +Gotcha: the `TerrainBlendingContext` is shared. Need to confirm it's +read-only (it is — built once at startup). Also `_surfaceCache` — +currently a plain `Dictionary` populated lazily by `TerrainBlending.BuildSurface`. +Either lock it, replace with `ConcurrentDictionary`, or pre-populate with +all known palCodes at startup. + +### 5. Streaming hysteresis at the tier boundary + +When the player crosses N₁ → near-tier shrinks, far-tier grows. +LBs that were near-tier need to: +- Drop their scenery (unregister entities) +- Drop their EnvCells +- Keep the terrain mesh (still in far tier) + +When the player crosses back: the LB needs scenery + EnvCells re-loaded. +Hysteresis (don't churn at the exact boundary) is needed. + +The streaming loader already has hysteresis for full LB load/unload. A.5 +extends that: a separate hysteresis radius for the scenery/entity layer. + +### 6. Visual quality wins to ride along + +A.5 is the natural place to land 2-3 nearly-free quality wins: + +- **Mipmapped terrain atlas + anisotropic 16x.** Today the atlas is + `GL_LINEAR` no mipmaps; distant terrain shimmers. ~half-day fix. + Big visible improvement at far tier. +- **Tree alpha-test → alpha-to-coverage with MSAA.** Today tree edges are + binary cutoff and pixel-edged. A2C with MSAA fixes them. ~one day. +- **Correct depth-write for transparent foliage.** Some scenery passes + may be writing depth incorrectly; confirm + fix. + +These are not strictly required for A.5 to ship, but they amplify the +"looks great" payoff. + +### 7. Acceptance metrics + +The user's goal is "smooth + high FPS + great-looking + scales." Pin +this concretely: + +- Target FPS at radius (whatever final N₁ + N₂): ≥ user's monitor refresh + (probably 144 or 240 Hz). Capture before/after numbers in a perf + baseline doc parallel to N.5b's. +- No frame-time spikes > 5ms during streaming (record a 60-second + trace running through Holtburg → North Yanshi). +- Visual horizon visible at the new N₂. Capture screenshots from the + same vantage point at the start of A.5 (before) and at ship (after) + for the SHIP record. + +### 8. What's NOT in A.5 + +A.5 does not need to ship: +- GPU-side culling (compute-shader cull). Bigger lift; N.6 territory. +- Persistent-mapped indirect buffer. N.6 territory. +- Sky / particles / EnvCells migration. Separate N.7+ phases. +- Shadow mapping. Separate visual phase. + +Don't let scope creep pull these in. + +--- + +## Files to read before brainstorming + +In rough order of relevance: + +1. **`docs/research/2026-05-09-phase-n5b-handoff.md`** — N.5b's handoff + (read for context on what was just shipped + the structure of these + handoff docs). +2. **`docs/plans/2026-05-09-phase-n5b-perf-baseline.md`** — captured + perf numbers + the architectural reasoning for what A.5 inherits. +3. **`memory/project_phase_n5b_state.md`** — three high-value gotchas + captured during N.5b (especially #1: bindless uniform-sampler driver + quirk; A.5 won't directly need this, but it's the prior art for any + new shader code in the phase). +4. **`docs/plans/2026-04-11-roadmap.md`** A.5 entry — the original A.5 + description. +5. **The streaming loader** — `src/AcDream.Core/World/StreamingLoader.cs` + (or wherever it lives; grep for `OnLandblockLoaded`). Understand the + existing ring + hysteresis logic before extending it. +6. **WB dispatcher entity flow** — + `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` lines covering + `Draw` (the per-entity walk) and `EntitySpawnAdapter` (where entities + get registered). The bucketing change lands here. +7. **`LandblockMesh.Build`** — `src/AcDream.Core/Terrain/LandblockMesh.cs`. + Its inputs (heightmap, ctx, surfaceCache) determine what the worker + thread needs. ~150 lines. +8. **WB's `SceneryRenderManager`** — + `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/SceneryRenderManager.cs`. + Has a render-distance cap; informs N₁ vs N₂ defaults. +9. **`TerrainModernRenderer`** — + `src/AcDream.App/Rendering/TerrainModernRenderer.cs`. Don't modify; + confirm the slot allocator handles radius=15 cleanly. + +--- + +## Acceptance criteria for the whole phase + +1. Build green; existing tests stay green; N.5b's conformance sentinel + still passes (visual mesh Z = TerrainSurface Z within 1mm). +2. **Far-tier LBs render terrain visibly past N₁** in user-driven visual + verification. +3. **Per-frame entity-dispatcher cpu_us at radius=N₁ drops** vs today + (the bucketing should help even at the current radius). +4. **Per-frame entity-dispatcher cpu_us at radius (N₁+N₂) is bounded** + — does NOT scale linearly with total loaded LBs. Specifically: + bucketed cull should be < 1.5× today's cost despite far-tier LBs + loading. +5. **No streaming hitch > 5ms** when running at run-speed across N₁/N₂ + tier boundaries simultaneously (capture a 60s trace). +6. **`[TERRAIN-DIAG]` cpu_us stays flat** as N₂ grows — the terrain + dispatcher proven O(1) (regression check). +7. Visual identity at near-tier (no scenery missing inside N₁; no + z-fighting; no cell-boundary wobble — N.5b sentinel still applies). +8. SHIP record + perf baseline + memory entry written, mirroring N.5b's + pattern. + +--- + +## What you'll be doing in the first 30 minutes + +1. Read this handoff in full. +2. Read `docs/research/2026-05-09-phase-n5b-handoff.md` for the structural + pattern. +3. Read `docs/plans/2026-05-09-phase-n5b-perf-baseline.md` for the captured + numbers A.5 inherits. +4. Read `memory/project_phase_n5b_state.md` for gotchas. +5. Verify build is green: `dotnet build`. +6. Verify N.5b ship is intact: `dotnet test --filter "FullyQualifiedName~TerrainSlot|FullyQualifiedName~TerrainModernConformance|FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition|FullyQualifiedName~TextureCacheBindless"` (target ≥114 passing, 0 failures). +7. Capture a baseline radius=5 frame trace yourself (one launch, 30s + standstill at Holtburg dueling field) so you have a "before" number + in your own measurement environment, not just trusting N.5b's number. +8. Invoke `superpowers:brainstorming` with the user. Walk through the + 8 brainstorm questions above. Present each with options + my + recommendation; don't prejudge. +9. After agreement, write the spec; then the plan; then execute + task-by-task using `superpowers:subagent-driven-development`. + +Don't skip the brainstorm. The N₁/N₂ values, the bucketing structure +trade-offs, and the worker-thread design are real decisions with +downstream consequences that need user input — not "the agent makes a +call and goes." + +--- + +## Things to NOT do + +- **Don't raise `ACDREAM_STREAM_RADIUS` without A.5's tiered loading + in place.** The entity-cull cliff is immediate and severe (8 FPS at + naive radius=15). +- **Don't put scenery in the far tier just to "look more retail" without + a billboard/impostor pipeline.** Full-detail scenery in the far tier + is what causes the cull cliff. +- **Don't move `LandblockMesh.Build` to a worker thread without first + auditing `TerrainBlendingContext` + `_surfaceCache` for thread + safety.** Concurrent writes to the surfaceCache will produce + silently-wrong terrain blending. +- **Don't break the N.5b conformance sentinel.** If A.5 changes how + meshes are built (e.g., for the worker thread), the conformance + test must still pass — it's the load-bearing physics ↔ visual Z + agreement guard. +- **Don't bundle GPU-side culling, persistent-mapped buffers, or shadow + mapping into A.5.** Those are N.6+ territory; A.5 is "make the world + look big and not stutter." +- **Don't ship without honest perf numbers.** If A.5 doesn't actually + hit its FPS target, document why and ship N.6 next instead of + papering over it. The N.5b precedent is honest reporting. +- **Don't skip the visual verification gate.** Same lesson from N.5b's + black-terrain regression: "go" doesn't mean "verified." User must + actually launch the client at radius=N₂ and confirm the horizon + looks great + FPS hits target. + +--- + +## Reference: where the FPS budget actually goes today + +For brainstorming purposes, the per-frame breakdown at radius=5 / Holtburg +(real measurement, 2026-05-09): + +``` +~5,000 µs total frame time (= 200 fps) +├── 4,300 µs WbDrawDispatcher entity cull + dispatch ← THE BOTTLENECK +│ ~16K entity AABB tests / frame +│ A.5's entity bucketing attacks this directly +├── 6 µs TerrainModernRenderer +│ O(1) in radius. Won't grow with A.5. Already solved. +├── ~700 µs Sky, particles, ImGui, audio, swap-buffers, misc +│ Mostly fixed cost; some VSync-related +└── rest GPU side (we don't measure this — query plumbing + deferred to N.6). Could be substantial. +``` + +The first action of A.5 is to recognize that the perf claim "810 fps" +from N.5 was misleading. Don't repeat the mistake — measure the actual +frame time, not just one subsystem. + +--- + +Good luck. The phase is meaty (~2 weeks) but the structural work is +well-shaped: tiered streaming has clear boundaries, entity bucketing is +an isolated dispatcher change, off-thread mesh build is a well-understood +worker pattern. The hard call is the N₁/N₂ values, and that's a +brainstorm question — bring it to the user with data.