acdream/docs/research/2026-05-10-phase-a5-handoff.md
Erik f7f88674e1 docs(A.5): cold-start handoff for the next session
Records what N.5b shipped, where the actual FPS bottleneck lives
(WbDrawDispatcher entity cull at ~4.3ms/frame, 86% of frame budget;
terrain dispatcher is now <1% of frame), and what A.5 has to do to
make the world look big without falling off a perf cliff.

Three concrete A.5 deliverables:
1. Two-tier streaming (near = full, far = terrain-only)
2. Per-LB entity bucketing in WbDrawDispatcher
3. Off-thread LandblockMesh.Build to avoid streaming hitches at higher
   radius

Eight brainstorm questions for the next session, plus acceptance
criteria, files-to-read list, and explicit "don't do" warnings (don't
raise STREAM_RADIUS without tiering in place; don't put scenery in
far tier without an impostor pipeline; don't break the N.5b conformance
sentinel; etc.).

User's stated goal verbatim: "great smooth HIGH fps visuals. Should
look great. As long as it scales and we get very high FPS." This
reframes priorities away from radius=5 micro-optimization toward
visual scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 21:11:46 +02:00

17 KiB
Raw Permalink Blame History

Phase A.5 — Two-tier Streaming + Horizon LOD — Cold-Start Handoff

Created: 2026-05-10, immediately after N.5b ship. Audience: the next agent picking up streaming + horizon-LOD work. Purpose: brief you on where N.5b left things, what A.5 actually has to do to make the world look and feel great, and the load-bearing facts the brainstorm should be informed by.


TL;DR

N.5b just shipped: outdoor terrain rendering is on bindless + multi-draw indirect via TerrainModernRenderer. Constant-cost dispatch as the visible landblock count grows — radius=5 vs radius=15 are the same number of GL calls for terrain.

A.5's actual goal — verbatim from the user, 2026-05-09:

"I just want great smooth HIGH fps visuals. Should look great. As long as it scales and we get very high FPS"

That reframes priorities. We are NOT optimizing the inner loop at radius=5 (it's solved). We're scaling visual reach + scene density without the client falling off a perf cliff.

Concretely, A.5 ships three things:

  1. Two-tier streaming. Near tier (≤ N₁ landblocks) loads everything as today (terrain + scenery + EnvCells + collision). Far tier (N₁ < r ≤ N₂) loads terrain mesh ONLY. No scenery generation, no collision, no entity registration for the far tier.
  2. Per-LB entity bucketing for the WB dispatcher. Today the entity dispatcher walks every loaded entity each frame for AABB cull — ~16K entities @ ~1µs/test = 4.3ms/frame, dominating the frame budget. Bucket entities by landblock so the cull is hierarchical: cull the LB first, then only walk entities inside surviving LBs.
  3. Off-thread mesh build. LandblockMesh.Build currently runs on the render thread when a new LB streams in. At today's radius=5 this is invisible; at A.5's higher N₂ it becomes a visible frame-time spike when 4-5 LBs stream simultaneously. Move the build to a worker pool; hand finished LandblockMeshData back via a queue.

The headline win you're shooting for: radius=15 sustains the user's target FPS in Holtburg with no streaming hitches.


Where N.5b left things

Branch state (relative to main)

After N.5b ships:

  • N.5b SHIP at 08b7362 (final commit; appended SHIP record to plan)
  • Roadmap entry, issue #51 closure, perf baseline doc all in place at 083c10c
  • Legacy TerrainChunkRenderer + TerrainRenderer + terrain.vert/.frag deleted at 7dfa2af. The modern path is the only path.

Captured perf baseline (load-bearing for A.5's "what's actually hot")

From docs/plans/2026-05-09-phase-n5b-perf-baseline.md, measured 2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill:

Subsystem cpu_us median per frame Notes
Entity dispatcher (WbDrawDispatcher) ~4,300 86% of frame budget. ~16K entities walked for AABB cull. THIS is the bottleneck.
Terrain dispatcher (TerrainModernRenderer) ~6.4 <1% of frame. Constant-cost regardless of radius (proved in N.5b).
Everything else (sky, particles, ImGui, swap, audio) ~700 Small.

Actual FPS at radius=5 in Holtburg: ~200 fps (frame time ≈ 5ms). NOT the "810 fps" inferred from the N.5 ship doc (that was 1/dispatcher_ms, which is only the WB dispatcher CPU cost in isolation, not real frame time).

What naive radius increase does

If you simply raised ACDREAM_STREAM_RADIUS to 15 today without A.5:

  • Loaded landblocks: 121 → ~961 (8× more). Acceptable.
  • Loaded entities: ~16K → ~125K (linear scaling with LB count). NOT acceptable. At ~1µs per AABB cull, the entity dispatcher would take ~125ms/frame = 8 FPS. Slideshow.
  • Memory footprint: similar 8× explosion in scenery instance buffers.

So the perf cliff is real and immediate. A.5 has to address it BEFORE the radius can be safely raised.

What N.5b set up that A.5 inherits

  • Modern terrain dispatcher. TerrainModernRenderer is O(1) GL calls in radius. As you add far-tier LBs (terrain only), the terrain dispatcher cost stays flat (~6µs/frame). This is the one subsystem that doesn't need any A.5 work — it just scales.
  • Slot allocator for terrain GPU buffers. Already grows by power-of-two doubling. Will absorb radius=15 (~961 slots × ~15 KB each = ~14 MB) without manual tuning.
  • [TERRAIN-DIAG] instrumentation. Reports per-frame median + p95 in microseconds. Use this to confirm A.5 doesn't regress terrain perf.
  • Conformance sentinel. TerrainModernConformanceTests proves visual mesh Z agrees with TerrainSurface.SampleZFromHeightmap to 0.015 mm. Don't break this — physics ↔ visual agreement must hold across both tiers.
  • Bindless atlas. TerrainAtlas.GetBindlessHandles(). The far tier shares the atlas (it's region-wide). Zero atlas-related per-LB cost.

The brainstorm questions (the hard calls A.5 has to make)

These are the questions to resolve in the brainstorm step. Bring them to the user with options + recommendation; don't prejudge.

1. Tier radii: what are N₁ and N₂?

  • N₁ = near-tier radius (everything loads). Today's default STREAM_RADIUS. Probably stays at 5 (or maybe 4; maybe 3).
  • N₂ = far-tier radius (terrain mesh only). Could be 8, 12, 15, 20.

Tradeoffs: bigger N₂ = more world visible = looks better. But each far-tier LB still costs ~16 KB GPU memory + a frustum cull AABB + a slot allocation. At N₂=15, that's ~961 LBs × 16 KB = ~15 MB GPU mem (cheap) + ~961 cull tests (cheap, ~1ms total at 1µs each — and we'll do this per-LB cull anyway as part of #2 below).

Verify against retail: cdb attach + check how many landblocks retail keeps loaded at a given vantage point. Probably around 10-12 per the AC2D references and the holtburger client's behavior.

2. Far tier: terrain only? Or also impostor scenery?

Two options:

  • Terrain only (cleanest). Beyond N₁, no trees, no rocks. Skyline is the terrain mesh against the sky.
  • Impostor scenery (more retail-like). Beyond N₁, generate flat billboards or low-poly trees instead of full meshes. Adds substantial complexity (billboard pipeline, mesh-LOD generation, per-camera-angle rotation).

Recommendation: start with terrain-only. Add impostors only if the horizon looks wrong (too bare). Retail definitely has SOME distant scenery but the cutoff is gradual; we can match it later if needed.

3. Entity bucketing structure

Today: WbDrawDispatcher keeps a flat dictionary of all entities and walks all of them per frame. To bucket by LB, we need:

  • A Dictionary<uint, List<EntityHandle>> keyed by landblock ID
  • On AddEntity(...), also stash it in the LB bucket (the spawn flow already knows the LB context)
  • On RemoveEntity(...), remove from the LB bucket too
  • Per frame: cull at LB granularity first; then cull entities only inside surviving LBs

LB-level AABBs are already computed (per the existing _visibleSlots logic in TerrainModernRenderer — the same AABB applies to entities, modulo a Z-range bump for trees/buildings).

Open question: do entities outside a known LB exist? (Items dropped on the ground? Ephemeral effects? Player projectiles?) If yes, they need a fallback "unknown LB" bucket that's still walked every frame. Probably small.

4. Where does the off-thread mesh build land?

Today LandblockMesh.Build runs synchronously inside OnLandblockLoaded on the render thread. To move it off:

  • StreamingLoader worker thread (already async for dat reads) signals "LB X is ready"
  • A new worker pool consumes that signal, builds the mesh on a worker thread, posts the finished LandblockMeshData to a ConcurrentQueue
  • Render thread drains the queue at the start of each frame, calling _terrain.AddLandblock(...) for each ready mesh

Gotcha: the TerrainBlendingContext is shared. Need to confirm it's read-only (it is — built once at startup). Also _surfaceCache — currently a plain Dictionary populated lazily by TerrainBlending.BuildSurface. Either lock it, replace with ConcurrentDictionary, or pre-populate with all known palCodes at startup.

5. Streaming hysteresis at the tier boundary

When the player crosses N₁ → near-tier shrinks, far-tier grows. LBs that were near-tier need to:

  • Drop their scenery (unregister entities)
  • Drop their EnvCells
  • Keep the terrain mesh (still in far tier)

When the player crosses back: the LB needs scenery + EnvCells re-loaded. Hysteresis (don't churn at the exact boundary) is needed.

The streaming loader already has hysteresis for full LB load/unload. A.5 extends that: a separate hysteresis radius for the scenery/entity layer.

6. Visual quality wins to ride along

A.5 is the natural place to land 2-3 nearly-free quality wins:

  • Mipmapped terrain atlas + anisotropic 16x. Today the atlas is GL_LINEAR no mipmaps; distant terrain shimmers. ~half-day fix. Big visible improvement at far tier.
  • Tree alpha-test → alpha-to-coverage with MSAA. Today tree edges are binary cutoff and pixel-edged. A2C with MSAA fixes them. ~one day.
  • Correct depth-write for transparent foliage. Some scenery passes may be writing depth incorrectly; confirm + fix.

These are not strictly required for A.5 to ship, but they amplify the "looks great" payoff.

7. Acceptance metrics

The user's goal is "smooth + high FPS + great-looking + scales." Pin this concretely:

  • Target FPS at radius (whatever final N₁ + N₂): ≥ user's monitor refresh (probably 144 or 240 Hz). Capture before/after numbers in a perf baseline doc parallel to N.5b's.
  • No frame-time spikes > 5ms during streaming (record a 60-second trace running through Holtburg → North Yanshi).
  • Visual horizon visible at the new N₂. Capture screenshots from the same vantage point at the start of A.5 (before) and at ship (after) for the SHIP record.

8. What's NOT in A.5

A.5 does not need to ship:

  • GPU-side culling (compute-shader cull). Bigger lift; N.6 territory.
  • Persistent-mapped indirect buffer. N.6 territory.
  • Sky / particles / EnvCells migration. Separate N.7+ phases.
  • Shadow mapping. Separate visual phase.

Don't let scope creep pull these in.


Files to read before brainstorming

In rough order of relevance:

  1. docs/research/2026-05-09-phase-n5b-handoff.md — N.5b's handoff (read for context on what was just shipped + the structure of these handoff docs).
  2. docs/plans/2026-05-09-phase-n5b-perf-baseline.md — captured perf numbers + the architectural reasoning for what A.5 inherits.
  3. memory/project_phase_n5b_state.md — three high-value gotchas captured during N.5b (especially #1: bindless uniform-sampler driver quirk; A.5 won't directly need this, but it's the prior art for any new shader code in the phase).
  4. docs/plans/2026-04-11-roadmap.md A.5 entry — the original A.5 description.
  5. The streaming loadersrc/AcDream.Core/World/StreamingLoader.cs (or wherever it lives; grep for OnLandblockLoaded). Understand the existing ring + hysteresis logic before extending it.
  6. WB dispatcher entity flowsrc/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs lines covering Draw (the per-entity walk) and EntitySpawnAdapter (where entities get registered). The bucketing change lands here.
  7. LandblockMesh.Buildsrc/AcDream.Core/Terrain/LandblockMesh.cs. Its inputs (heightmap, ctx, surfaceCache) determine what the worker thread needs. ~150 lines.
  8. WB's SceneryRenderManagerreferences/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/SceneryRenderManager.cs. Has a render-distance cap; informs N₁ vs N₂ defaults.
  9. TerrainModernRenderersrc/AcDream.App/Rendering/TerrainModernRenderer.cs. Don't modify; confirm the slot allocator handles radius=15 cleanly.

Acceptance criteria for the whole phase

  1. Build green; existing tests stay green; N.5b's conformance sentinel still passes (visual mesh Z = TerrainSurface Z within 1mm).
  2. Far-tier LBs render terrain visibly past N₁ in user-driven visual verification.
  3. Per-frame entity-dispatcher cpu_us at radius=N₁ drops vs today (the bucketing should help even at the current radius).
  4. Per-frame entity-dispatcher cpu_us at radius (N₁+N₂) is bounded — does NOT scale linearly with total loaded LBs. Specifically: bucketed cull should be < 1.5× today's cost despite far-tier LBs loading.
  5. No streaming hitch > 5ms when running at run-speed across N₁/N₂ tier boundaries simultaneously (capture a 60s trace).
  6. [TERRAIN-DIAG] cpu_us stays flat as N₂ grows — the terrain dispatcher proven O(1) (regression check).
  7. Visual identity at near-tier (no scenery missing inside N₁; no z-fighting; no cell-boundary wobble — N.5b sentinel still applies).
  8. SHIP record + perf baseline + memory entry written, mirroring N.5b's pattern.

What you'll be doing in the first 30 minutes

  1. Read this handoff in full.
  2. Read docs/research/2026-05-09-phase-n5b-handoff.md for the structural pattern.
  3. Read docs/plans/2026-05-09-phase-n5b-perf-baseline.md for the captured numbers A.5 inherits.
  4. Read memory/project_phase_n5b_state.md for gotchas.
  5. Verify build is green: dotnet build.
  6. Verify N.5b ship is intact: dotnet test --filter "FullyQualifiedName~TerrainSlot|FullyQualifiedName~TerrainModernConformance|FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition|FullyQualifiedName~TextureCacheBindless" (target ≥114 passing, 0 failures).
  7. Capture a baseline radius=5 frame trace yourself (one launch, 30s standstill at Holtburg dueling field) so you have a "before" number in your own measurement environment, not just trusting N.5b's number.
  8. Invoke superpowers:brainstorming with the user. Walk through the 8 brainstorm questions above. Present each with options + my recommendation; don't prejudge.
  9. After agreement, write the spec; then the plan; then execute task-by-task using superpowers:subagent-driven-development.

Don't skip the brainstorm. The N₁/N₂ values, the bucketing structure trade-offs, and the worker-thread design are real decisions with downstream consequences that need user input — not "the agent makes a call and goes."


Things to NOT do

  • Don't raise ACDREAM_STREAM_RADIUS without A.5's tiered loading in place. The entity-cull cliff is immediate and severe (8 FPS at naive radius=15).
  • Don't put scenery in the far tier just to "look more retail" without a billboard/impostor pipeline. Full-detail scenery in the far tier is what causes the cull cliff.
  • Don't move LandblockMesh.Build to a worker thread without first auditing TerrainBlendingContext + _surfaceCache for thread safety. Concurrent writes to the surfaceCache will produce silently-wrong terrain blending.
  • Don't break the N.5b conformance sentinel. If A.5 changes how meshes are built (e.g., for the worker thread), the conformance test must still pass — it's the load-bearing physics ↔ visual Z agreement guard.
  • Don't bundle GPU-side culling, persistent-mapped buffers, or shadow mapping into A.5. Those are N.6+ territory; A.5 is "make the world look big and not stutter."
  • Don't ship without honest perf numbers. If A.5 doesn't actually hit its FPS target, document why and ship N.6 next instead of papering over it. The N.5b precedent is honest reporting.
  • Don't skip the visual verification gate. Same lesson from N.5b's black-terrain regression: "go" doesn't mean "verified." User must actually launch the client at radius=N₂ and confirm the horizon looks great + FPS hits target.

Reference: where the FPS budget actually goes today

For brainstorming purposes, the per-frame breakdown at radius=5 / Holtburg (real measurement, 2026-05-09):

~5,000 µs total frame time (= 200 fps)
├── 4,300 µs  WbDrawDispatcher entity cull + dispatch  ← THE BOTTLENECK
│             ~16K entity AABB tests / frame
│             A.5's entity bucketing attacks this directly
├──     6 µs  TerrainModernRenderer
│             O(1) in radius. Won't grow with A.5. Already solved.
├──   ~700 µs Sky, particles, ImGui, audio, swap-buffers, misc
│             Mostly fixed cost; some VSync-related
└──    rest   GPU side (we don't measure this — query plumbing
              deferred to N.6). Could be substantial.

The first action of A.5 is to recognize that the perf claim "810 fps" from N.5 was misleading. Don't repeat the mistake — measure the actual frame time, not just one subsystem.


Good luck. The phase is meaty (~2 weeks) but the structural work is well-shaped: tiered streaming has clear boundaries, entity bucketing is an isolated dispatcher change, off-thread mesh build is a well-understood worker pattern. The hard call is the N₁/N₂ values, and that's a brainstorm question — bring it to the user with data.