docs(A.5): cold-start handoff for the next session

Records what N.5b shipped, where the actual FPS bottleneck lives
(WbDrawDispatcher entity cull at ~4.3ms/frame, 86% of frame budget;
terrain dispatcher is now <1% of frame), and what A.5 has to do to
make the world look big without falling off a perf cliff.

Three concrete A.5 deliverables:
1. Two-tier streaming (near = full, far = terrain-only)
2. Per-LB entity bucketing in WbDrawDispatcher
3. Off-thread LandblockMesh.Build to avoid streaming hitches at higher
   radius

Eight brainstorm questions for the next session, plus acceptance
criteria, files-to-read list, and explicit "don't do" warnings (don't
raise STREAM_RADIUS without tiering in place; don't put scenery in
far tier without an impostor pipeline; don't break the N.5b conformance
sentinel; etc.).

User's stated goal verbatim: "great smooth HIGH fps visuals. Should
look great. As long as it scales and we get very high FPS." This
reframes priorities away from radius=5 micro-optimization toward
visual scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erik 2026-05-09 21:11:46 +02:00
parent 08b736207c
commit f7f88674e1

View file

@ -0,0 +1,376 @@
# Phase A.5 — Two-tier Streaming + Horizon LOD — Cold-Start Handoff
**Created:** 2026-05-10, immediately after N.5b ship.
**Audience:** the next agent picking up streaming + horizon-LOD work.
**Purpose:** brief you on where N.5b left things, what A.5 actually has to do
to make the world look and feel great, and the load-bearing facts the
brainstorm should be informed by.
---
## TL;DR
N.5b just shipped: outdoor terrain rendering is on bindless + multi-draw
indirect via `TerrainModernRenderer`. Constant-cost dispatch as the
visible landblock count grows — radius=5 vs radius=15 are the same number
of GL calls for terrain.
**A.5's actual goal — verbatim from the user, 2026-05-09:**
> "I just want great smooth HIGH fps visuals. Should look great. As long
> as it scales and we get very high FPS"
That reframes priorities. We are NOT optimizing the inner loop at radius=5
(it's solved). We're scaling visual reach + scene density without the
client falling off a perf cliff.
**Concretely, A.5 ships three things:**
1. **Two-tier streaming.** Near tier (≤ N₁ landblocks) loads everything as
today (terrain + scenery + EnvCells + collision). Far tier (N₁ < r N)
loads terrain mesh ONLY. No scenery generation, no collision, no
entity registration for the far tier.
2. **Per-LB entity bucketing for the WB dispatcher.** Today the entity
dispatcher walks every loaded entity each frame for AABB cull —
~16K entities @ ~1µs/test = 4.3ms/frame, dominating the frame budget.
Bucket entities by landblock so the cull is hierarchical: cull the LB
first, then only walk entities inside surviving LBs.
3. **Off-thread mesh build.** `LandblockMesh.Build` currently runs on the
render thread when a new LB streams in. At today's radius=5 this is
invisible; at A.5's higher N₂ it becomes a visible frame-time spike
when 4-5 LBs stream simultaneously. Move the build to a worker pool;
hand finished `LandblockMeshData` back via a queue.
The headline win you're shooting for: **radius=15 sustains the user's
target FPS in Holtburg with no streaming hitches.**
---
## Where N.5b left things
### Branch state (relative to main)
After N.5b ships:
- N.5b SHIP at `08b7362` (final commit; appended SHIP record to plan)
- Roadmap entry, issue #51 closure, perf baseline doc all in place at `083c10c`
- Legacy `TerrainChunkRenderer` + `TerrainRenderer` + `terrain.vert/.frag`
deleted at `7dfa2af`. **The modern path is the only path.**
### Captured perf baseline (load-bearing for A.5's "what's actually hot")
From `docs/plans/2026-05-09-phase-n5b-perf-baseline.md`, measured
2026-05-09 at Holtburg town dueling field, radius=5, ~30s standstill:
| Subsystem | cpu_us median per frame | Notes |
|---|---|---|
| **Entity dispatcher** (`WbDrawDispatcher`) | **~4,300** | 86% of frame budget. ~16K entities walked for AABB cull. THIS is the bottleneck. |
| Terrain dispatcher (`TerrainModernRenderer`) | ~6.4 | <1% of frame. Constant-cost regardless of radius (proved in N.5b). |
| Everything else (sky, particles, ImGui, swap, audio) | ~700 | Small. |
**Actual FPS at radius=5 in Holtburg: ~200 fps** (frame time ≈ 5ms).
NOT the "810 fps" inferred from the N.5 ship doc (that was 1/dispatcher_ms,
which is only the WB dispatcher CPU cost in isolation, not real frame time).
### What naive radius increase does
If you simply raised `ACDREAM_STREAM_RADIUS` to 15 today without A.5:
- Loaded landblocks: 121 → ~961 (8× more). Acceptable.
- Loaded entities: ~16K → ~125K (linear scaling with LB count). **NOT
acceptable.** At ~1µs per AABB cull, the entity dispatcher would take
~125ms/frame = 8 FPS. Slideshow.
- Memory footprint: similar 8× explosion in scenery instance buffers.
So the perf cliff is real and immediate. A.5 has to address it BEFORE
the radius can be safely raised.
### What N.5b set up that A.5 inherits
- **Modern terrain dispatcher.** `TerrainModernRenderer` is O(1) GL calls
in radius. As you add far-tier LBs (terrain only), the terrain
dispatcher cost stays flat (~6µs/frame). This is the one subsystem
that doesn't need any A.5 work — it just scales.
- **Slot allocator for terrain GPU buffers.** Already grows by power-of-two
doubling. Will absorb radius=15 (~961 slots × ~15 KB each = ~14 MB)
without manual tuning.
- **`[TERRAIN-DIAG]` instrumentation.** Reports per-frame median + p95 in
microseconds. Use this to confirm A.5 doesn't regress terrain perf.
- **Conformance sentinel.** `TerrainModernConformanceTests` proves visual
mesh Z agrees with `TerrainSurface.SampleZFromHeightmap` to 0.015 mm.
Don't break this — physics ↔ visual agreement must hold across both
tiers.
- **Bindless atlas.** `TerrainAtlas.GetBindlessHandles()`. The far tier
shares the atlas (it's region-wide). Zero atlas-related per-LB cost.
---
## The brainstorm questions (the hard calls A.5 has to make)
These are the questions to resolve in the brainstorm step. Bring them to
the user with options + recommendation; don't prejudge.
### 1. Tier radii: what are N₁ and N₂?
- **N₁** = near-tier radius (everything loads). Today's default `STREAM_RADIUS`.
Probably stays at 5 (or maybe 4; maybe 3).
- **N₂** = far-tier radius (terrain mesh only). Could be 8, 12, 15, 20.
Tradeoffs: bigger N₂ = more world visible = looks better. But each far-tier
LB still costs ~16 KB GPU memory + a frustum cull AABB + a slot allocation.
At N₂=15, that's ~961 LBs × 16 KB = ~15 MB GPU mem (cheap) + ~961 cull
tests (cheap, ~1ms total at 1µs each — and we'll do this per-LB cull
anyway as part of #2 below).
Verify against retail: cdb attach + check how many landblocks retail keeps
loaded at a given vantage point. Probably around 10-12 per the AC2D
references and the holtburger client's behavior.
### 2. Far tier: terrain only? Or also impostor scenery?
Two options:
- **Terrain only** (cleanest). Beyond N₁, no trees, no rocks. Skyline is the
terrain mesh against the sky.
- **Impostor scenery** (more retail-like). Beyond N₁, generate flat
billboards or low-poly trees instead of full meshes. Adds substantial
complexity (billboard pipeline, mesh-LOD generation, per-camera-angle
rotation).
Recommendation: start with terrain-only. Add impostors only if the
horizon looks wrong (too bare). Retail definitely has SOME distant
scenery but the cutoff is gradual; we can match it later if needed.
### 3. Entity bucketing structure
Today: `WbDrawDispatcher` keeps a flat dictionary of all entities and
walks all of them per frame. To bucket by LB, we need:
- A `Dictionary<uint, List<EntityHandle>>` keyed by landblock ID
- On `AddEntity(...)`, also stash it in the LB bucket (the spawn flow
already knows the LB context)
- On `RemoveEntity(...)`, remove from the LB bucket too
- Per frame: cull at LB granularity first; then cull entities only inside
surviving LBs
LB-level AABBs are already computed (per the existing `_visibleSlots`
logic in `TerrainModernRenderer` — the same AABB applies to entities,
modulo a Z-range bump for trees/buildings).
Open question: do entities outside a known LB exist? (Items dropped on the
ground? Ephemeral effects? Player projectiles?) If yes, they need a
fallback "unknown LB" bucket that's still walked every frame. Probably
small.
### 4. Where does the off-thread mesh build land?
Today `LandblockMesh.Build` runs synchronously inside `OnLandblockLoaded`
on the render thread. To move it off:
- `StreamingLoader` worker thread (already async for dat reads) signals
"LB X is ready"
- A new worker pool consumes that signal, builds the mesh on a worker
thread, posts the finished `LandblockMeshData` to a `ConcurrentQueue`
- Render thread drains the queue at the start of each frame, calling
`_terrain.AddLandblock(...)` for each ready mesh
Gotcha: the `TerrainBlendingContext` is shared. Need to confirm it's
read-only (it is — built once at startup). Also `_surfaceCache`
currently a plain `Dictionary` populated lazily by `TerrainBlending.BuildSurface`.
Either lock it, replace with `ConcurrentDictionary`, or pre-populate with
all known palCodes at startup.
### 5. Streaming hysteresis at the tier boundary
When the player crosses N₁ → near-tier shrinks, far-tier grows.
LBs that were near-tier need to:
- Drop their scenery (unregister entities)
- Drop their EnvCells
- Keep the terrain mesh (still in far tier)
When the player crosses back: the LB needs scenery + EnvCells re-loaded.
Hysteresis (don't churn at the exact boundary) is needed.
The streaming loader already has hysteresis for full LB load/unload. A.5
extends that: a separate hysteresis radius for the scenery/entity layer.
### 6. Visual quality wins to ride along
A.5 is the natural place to land 2-3 nearly-free quality wins:
- **Mipmapped terrain atlas + anisotropic 16x.** Today the atlas is
`GL_LINEAR` no mipmaps; distant terrain shimmers. ~half-day fix.
Big visible improvement at far tier.
- **Tree alpha-test → alpha-to-coverage with MSAA.** Today tree edges are
binary cutoff and pixel-edged. A2C with MSAA fixes them. ~one day.
- **Correct depth-write for transparent foliage.** Some scenery passes
may be writing depth incorrectly; confirm + fix.
These are not strictly required for A.5 to ship, but they amplify the
"looks great" payoff.
### 7. Acceptance metrics
The user's goal is "smooth + high FPS + great-looking + scales." Pin
this concretely:
- Target FPS at radius (whatever final N₁ + N₂): ≥ user's monitor refresh
(probably 144 or 240 Hz). Capture before/after numbers in a perf
baseline doc parallel to N.5b's.
- No frame-time spikes > 5ms during streaming (record a 60-second
trace running through Holtburg → North Yanshi).
- Visual horizon visible at the new N₂. Capture screenshots from the
same vantage point at the start of A.5 (before) and at ship (after)
for the SHIP record.
### 8. What's NOT in A.5
A.5 does not need to ship:
- GPU-side culling (compute-shader cull). Bigger lift; N.6 territory.
- Persistent-mapped indirect buffer. N.6 territory.
- Sky / particles / EnvCells migration. Separate N.7+ phases.
- Shadow mapping. Separate visual phase.
Don't let scope creep pull these in.
---
## Files to read before brainstorming
In rough order of relevance:
1. **`docs/research/2026-05-09-phase-n5b-handoff.md`** — N.5b's handoff
(read for context on what was just shipped + the structure of these
handoff docs).
2. **`docs/plans/2026-05-09-phase-n5b-perf-baseline.md`** — captured
perf numbers + the architectural reasoning for what A.5 inherits.
3. **`memory/project_phase_n5b_state.md`** — three high-value gotchas
captured during N.5b (especially #1: bindless uniform-sampler driver
quirk; A.5 won't directly need this, but it's the prior art for any
new shader code in the phase).
4. **`docs/plans/2026-04-11-roadmap.md`** A.5 entry — the original A.5
description.
5. **The streaming loader**`src/AcDream.Core/World/StreamingLoader.cs`
(or wherever it lives; grep for `OnLandblockLoaded`). Understand the
existing ring + hysteresis logic before extending it.
6. **WB dispatcher entity flow**
`src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs` lines covering
`Draw` (the per-entity walk) and `EntitySpawnAdapter` (where entities
get registered). The bucketing change lands here.
7. **`LandblockMesh.Build`** — `src/AcDream.Core/Terrain/LandblockMesh.cs`.
Its inputs (heightmap, ctx, surfaceCache) determine what the worker
thread needs. ~150 lines.
8. **WB's `SceneryRenderManager`**
`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/SceneryRenderManager.cs`.
Has a render-distance cap; informs N₁ vs N₂ defaults.
9. **`TerrainModernRenderer`** —
`src/AcDream.App/Rendering/TerrainModernRenderer.cs`. Don't modify;
confirm the slot allocator handles radius=15 cleanly.
---
## Acceptance criteria for the whole phase
1. Build green; existing tests stay green; N.5b's conformance sentinel
still passes (visual mesh Z = TerrainSurface Z within 1mm).
2. **Far-tier LBs render terrain visibly past N₁** in user-driven visual
verification.
3. **Per-frame entity-dispatcher cpu_us at radius=N₁ drops** vs today
(the bucketing should help even at the current radius).
4. **Per-frame entity-dispatcher cpu_us at radius (N₁+N₂) is bounded**
— does NOT scale linearly with total loaded LBs. Specifically:
bucketed cull should be < 1.5× today's cost despite far-tier LBs
loading.
5. **No streaming hitch > 5ms** when running at run-speed across N₁/N₂
tier boundaries simultaneously (capture a 60s trace).
6. **`[TERRAIN-DIAG]` cpu_us stays flat** as N₂ grows — the terrain
dispatcher proven O(1) (regression check).
7. Visual identity at near-tier (no scenery missing inside N₁; no
z-fighting; no cell-boundary wobble — N.5b sentinel still applies).
8. SHIP record + perf baseline + memory entry written, mirroring N.5b's
pattern.
---
## What you'll be doing in the first 30 minutes
1. Read this handoff in full.
2. Read `docs/research/2026-05-09-phase-n5b-handoff.md` for the structural
pattern.
3. Read `docs/plans/2026-05-09-phase-n5b-perf-baseline.md` for the captured
numbers A.5 inherits.
4. Read `memory/project_phase_n5b_state.md` for gotchas.
5. Verify build is green: `dotnet build`.
6. Verify N.5b ship is intact: `dotnet test --filter "FullyQualifiedName~TerrainSlot|FullyQualifiedName~TerrainModernConformance|FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition|FullyQualifiedName~TextureCacheBindless"` (target ≥114 passing, 0 failures).
7. Capture a baseline radius=5 frame trace yourself (one launch, 30s
standstill at Holtburg dueling field) so you have a "before" number
in your own measurement environment, not just trusting N.5b's number.
8. Invoke `superpowers:brainstorming` with the user. Walk through the
8 brainstorm questions above. Present each with options + my
recommendation; don't prejudge.
9. After agreement, write the spec; then the plan; then execute
task-by-task using `superpowers:subagent-driven-development`.
Don't skip the brainstorm. The N₁/N₂ values, the bucketing structure
trade-offs, and the worker-thread design are real decisions with
downstream consequences that need user input — not "the agent makes a
call and goes."
---
## Things to NOT do
- **Don't raise `ACDREAM_STREAM_RADIUS` without A.5's tiered loading
in place.** The entity-cull cliff is immediate and severe (8 FPS at
naive radius=15).
- **Don't put scenery in the far tier just to "look more retail" without
a billboard/impostor pipeline.** Full-detail scenery in the far tier
is what causes the cull cliff.
- **Don't move `LandblockMesh.Build` to a worker thread without first
auditing `TerrainBlendingContext` + `_surfaceCache` for thread
safety.** Concurrent writes to the surfaceCache will produce
silently-wrong terrain blending.
- **Don't break the N.5b conformance sentinel.** If A.5 changes how
meshes are built (e.g., for the worker thread), the conformance
test must still pass — it's the load-bearing physics ↔ visual Z
agreement guard.
- **Don't bundle GPU-side culling, persistent-mapped buffers, or shadow
mapping into A.5.** Those are N.6+ territory; A.5 is "make the world
look big and not stutter."
- **Don't ship without honest perf numbers.** If A.5 doesn't actually
hit its FPS target, document why and ship N.6 next instead of
papering over it. The N.5b precedent is honest reporting.
- **Don't skip the visual verification gate.** Same lesson from N.5b's
black-terrain regression: "go" doesn't mean "verified." User must
actually launch the client at radius=N₂ and confirm the horizon
looks great + FPS hits target.
---
## Reference: where the FPS budget actually goes today
For brainstorming purposes, the per-frame breakdown at radius=5 / Holtburg
(real measurement, 2026-05-09):
```
~5,000 µs total frame time (= 200 fps)
├── 4,300 µs WbDrawDispatcher entity cull + dispatch ← THE BOTTLENECK
│ ~16K entity AABB tests / frame
│ A.5's entity bucketing attacks this directly
├── 6 µs TerrainModernRenderer
│ O(1) in radius. Won't grow with A.5. Already solved.
├── ~700 µs Sky, particles, ImGui, audio, swap-buffers, misc
│ Mostly fixed cost; some VSync-related
└── rest GPU side (we don't measure this — query plumbing
deferred to N.6). Could be substantial.
```
The first action of A.5 is to recognize that the perf claim "810 fps"
from N.5 was misleading. Don't repeat the mistake — measure the actual
frame time, not just one subsystem.
---
Good luck. The phase is meaty (~2 weeks) but the structural work is
well-shaped: tiered streaming has clear boundaries, entity bucketing is
an isolated dispatcher change, off-thread mesh build is a well-understood
worker pattern. The hard call is the N₁/N₂ values, and that's a
brainstorm question — bring it to the user with data.