acdream/docs/research/2026-05-08-phase-n5-handoff.md
Erik dd5ca3d2b2 docs(N.5): cold-start handoff for next session
Detailed briefing for the next agent picking up Phase N.5 (Modern
Rendering Path: bindless textures + glMultiDrawElementsIndirect on
N.4's foundation). Covers:

- Where N.4 left things (commits, what works, gotchas inherited)
- The two-feature pairing (why bindless + indirect together)
- Files to read first (WB shaders, our dispatcher, CLAUDE.md cribs)
- 8 brainstorm questions to resolve before spec
- Spec + plan structure (matching N.4's pattern)
- Acceptance criteria
- Things to explicitly NOT do

Sized for a fresh session to pick up cold without spelunking through
months of session history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-08 18:05:36 +02:00

495 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase N.5 — Modern Rendering Path — Cold-Start Handoff
**Created:** 2026-05-08, immediately after N.4 ship.
**Audience:** the next agent picking up rendering perf work.
**Purpose:** give you everything you need to start N.5 cold, without
spelunking through five months of session history.
---
## TL;DR
N.4 just shipped: WB's `ObjectMeshManager` is now acdream's production
mesh pipeline, and `WbDrawDispatcher` is the production draw path. It
works (Holtburg renders correctly, FPS substantially improved over the
naïve dual-pipeline state we hit during week 4 verification) but it's
still doing per-group state changes (`glBindTexture`, `glBindBuffer`
for the IBO, `glDrawElementsInstancedBaseVertexBaseInstance` per group)
and a fresh `glBufferData` upload per frame.
**N.5's job: lift the dispatcher onto WB's modern rendering primitives
that we're already paying GPU-feature-detection cost for.** Two big
wins, paired:
1. **Bindless textures** (`GL_ARB_bindless_texture`) — WB already
populates `ObjectRenderBatch.BindlessTextureHandle`. Switch our
shader to read texture handles from a per-instance attribute
(`uvec2``sampler2D` via the bindless extension). Eliminates
100% of `glBindTexture` calls.
2. **Multi-draw indirect** (`glMultiDrawElementsIndirect`) — build a
buffer of `DrawElementsIndirectCommand` structs (one per group),
upload once, fire ONE `glMultiDrawElementsIndirect` call per pass.
The driver pulls everything from the indirect buffer.
Together they target a 2-5× CPU win on draw-heavy scenes (Holtburg
courtyard, Foundry, dense dungeons). They're packaged together because
both are "modern path" extensions we already gate on, both require
the same shader rewrite, and they pair naturally — multi-draw indirect
is a no-op CPU-win without bindless because per-group `glBindTexture`
calls would still serialize.
**Estimated scope: 2-3 weeks.** Plan + spec to be written by the
brainstorm + spec steps below.
---
## Where N.4 left things
### Branch state
If this handoff is being read on `main` after merging the N.4 worktree:
N.4 commits land at the head of main. The relevant final commits:
- `c445364` — N.4 SHIP (flag default-on, plan final, roadmap, memory)
- `573526d` — perf pass 1-4 (drop dead lookup, sort, cull, hash memo)
- `7b41efc` — FirstIndex/BaseVertex + Issue #47 + grouped instanced
- `943652d` — load triggers + `batch.Key.SurfaceId` source
- `01cff41` — Tasks 22+23 (`WbDrawDispatcher` + side-table)
If the worktree branch (`claude/tender-mcclintock-a16839`) hasn't been
merged yet, that's where the work is. Verify with `git log --oneline`.
### What works in N.4
- `ACDREAM_USE_WB_FOUNDATION=1` is default-on. WB's `ObjectMeshManager`
loads, decodes, and uploads every entity mesh. Our existing
`TextureCache` decodes textures (palette-aware, per-instance overrides
via `GetOrUploadWithPaletteOverride`).
- `WbDrawDispatcher.Draw`:
- Walks visible entities (per-landblock AABB cull + per-entity AABB
cull + portal visibility)
- Buckets every (entity × meshRef × batch) tuple by
`GroupKey(Ibo, FirstIndex, BaseVertex, IndexCount, TextureHandle, Translucency)`
- Single `glBufferData` upload of all matrices for the frame
- Per group: `glActiveTexture(0) + glBindTexture(2D, handle) + glBindBuffer(EBO, ibo) + glDrawElementsInstancedBaseVertexBaseInstance(..., FirstInstance)`
- Two passes: opaque (front-to-back sorted) + translucent
- 940/948 tests pass (8 pre-existing failures unrelated to rendering).
- Visual verification at Holtburg passed: scenery + characters render
correctly with full close-detail geometry (Issue #47 preserved).
### What N.5 inherits
These are levers N.5 will pull on:
- **WB's modern rendering is already active.** `OpenGLGraphicsDevice`
detected GL 4.3 + bindless on first run; WB's `_useModernRendering`
is true; every mesh lives in WB's single `GlobalMeshBuffer` (one VAO,
one VBO, one IBO).
- **Bindless handles are already populated.** `ObjectRenderBatch.BindlessTextureHandle`
is non-zero for batches WB owns the texture for. (See gotcha #2
below for entities with palette overrides — those use acdream's
`TextureCache` which doesn't expose bindless handles yet.)
- **The instance VBO is acdream-owned** (`WbDrawDispatcher._instanceVbo`)
with locations 3-6 patched onto WB's global VAO. Stride 64 bytes
(one mat4). N.5 expands this to (mat4 + uvec2 handle) = 80 bytes.
### Three load-bearing WB API gotchas N.4 surfaced
These bit us hard during Task 26 visual verification. Documented in
CLAUDE.md "WB integration cribs" + plan adjustments 7-9 +
`memory/project_phase_n4_state.md`. Re-stating here because they
reshape the design space:
1. **`ObjectMeshManager.IncrementRefCount(id)` is NOT lifecycle-aware.**
It only bumps a usage counter. Mesh loading is fired separately
via `PrepareMeshDataAsync(id, isSetup)`. The result auto-enqueues
to `_stagedMeshData` (line 510 of `ObjectMeshManager.cs`); our
existing `WbMeshAdapter.Tick()` drains it. `WbMeshAdapter.IncrementRefCount`
already calls `PrepareMeshDataAsync`. **N.5 doesn't need to change
this — just don't break it.**
2. **`ObjectRenderBatch.SurfaceId` is unset.** WB constructs batches
with `Key = batch.Key` (a `TextureAtlasManager.TextureKey` struct
that has a `SurfaceId` field) but never populates the top-level
`SurfaceId` property. Read `batch.Key.SurfaceId`. **N.5 keeps this
pattern.**
3. **WB's modern rendering packs every mesh into ONE global
VAO/VBO/IBO.** Each batch's `IBO` field points to the global IBO;
the batch's actual slice is identified by `FirstIndex` (offset into
IBO, in *indices*) and `BaseVertex` (offset into VBO, in *vertices*).
N.4's draw uses `glDrawElementsInstancedBaseVertexBaseInstance`
with those offsets. **N.5's `DrawElementsIndirectCommand` per-group
record will carry `firstIndex` + `baseVertex` for the same reason.**
---
## What N.5 is — technical detail
### The two-feature pairing
**Bindless textures** (`GL_ARB_bindless_texture`):
- Each texture handle is a 64-bit integer (`uvec2` in GLSL).
- Shader declares `layout(bindless_sampler) uniform sampler2D ...` or
receives the handle as a per-vertex-attribute `uvec2`.
- No `glBindTexture` needed at draw time — the handle IS the binding.
- Handle generation: `glGetTextureHandleARB(textureId)` followed by
`glMakeTextureHandleResidentARB(handle)` (the texture must be
resident on the GPU; non-resident handles produce GPU faults).
**Multi-draw indirect** (`glMultiDrawElementsIndirect`):
- Indirect command struct layout (must match `DrawElementsIndirectCommand`):
```c
struct {
uint count; // index count for this draw
uint instanceCount; // number of instances
uint firstIndex; // offset into IBO, in indices
int baseVertex; // vertex offset into VBO
uint baseInstance; // first instance ID (offsets per-instance attribs)
};
```
- Build a buffer of N of these structs (one per group), upload once,
fire one GL call: `glMultiDrawElementsIndirect(mode, indexType, ptr, drawcount, stride)`.
- The driver issues all N draws in one shot. Effectively zero CPU
overhead per draw beyond uploading the indirect buffer.
**Why pair them.** Multi-draw indirect doesn't let you change uniform
state between draws. So if textures are bound via `glBindTexture` per
group, you'd still need N CPU-side setup steps before each indirect
call — defeating the purpose. Bindless removes that constraint by
encoding the texture handle as per-instance data the shader reads
directly. With both, the modern render loop becomes:
```
1. Upload instance buffer (mat4 + uvec2 handle, per-instance) — once per frame
2. Upload indirect command buffer (one DEIC per group) — once per frame
3. glBindVertexArray(globalVAO) — once
4. glMultiDrawElementsIndirect(...) — ONCE per pass
```
That's it. No per-group state changes.
### Instance attribute layout
Currently (N.4): location 3-6 = mat4 model matrix (16 floats = 64 bytes).
N.5 (proposed): location 3-6 = mat4 + location 7 = uvec2 bindless
handle = 16 floats + 2 uints = 72 bytes (16-aligned to 80 bytes per
WB's `InstanceData` precedent).
Or use std140-aligned struct:
```c
struct InstanceData {
mat4 transform; // locations 3-6
uvec2 textureHandle; // location 7
uvec2 _pad; // padding to 80
};
```
Brainstorm should decide if we copy WB's `InstanceData` struct (Pack=16,
80 bytes including CellId/Flags fields we don't use) or define our own
minimal version. The 80-byte stride matches WB's so global VAO state
configured by WB stays compatible if the legacy WB draw path ever runs.
### Per-instance entity texture handles
Here's the wrinkle. N.4 uses `WbDrawDispatcher.ResolveTexture` to map
each (entity, batch) to a GL texture handle:
- Tree (no overrides): `_textures.GetOrUpload(surfaceId)` → 2D texture handle
- NPC with palette override: `_textures.GetOrUploadWithPaletteOverride(...)` → composite-cached 2D texture handle
- Anything with surface override: `_textures.GetOrUploadWithOrigTextureOverride(...)` → composite-cached 2D texture handle
Those are all `GLuint` 32-bit GL texture *names*, not bindless handles.
**N.5 needs `TextureCache` to publish bindless handles for everything
it owns, not just WB-owned textures.**
Implementation sketch:
- `TextureCache` adds a parallel cache keyed identically but storing
64-bit bindless handles. On first request, generate via
`glGetTextureHandleARB(textureId)` + make resident.
- New API: `GetBindlessHandle(uint surfaceId, ...)` returns the handle.
- Or: change every `GetOrUpload*` method to return both the GL name
and the bindless handle (or just the handle; let GL name fall out
if anyone needs it later).
WB's `ObjectRenderBatch.BindlessTextureHandle` covers the atlas-tier
case. For per-instance entities, we use `TextureCache`'s handle.
### The new shader
Reuse WB's `StaticObjectModern.vert` / `StaticObjectModern.frag` as a
template. Read those files cold. They already do bindless + the
instance-data layout. Adapt to acdream's `mesh_instanced.vert/frag`
conventions:
- Keep the `uViewProjection` uniform, lighting UBO at binding=1, fog
uniforms.
- Add `#version 430 core` + `#extension GL_ARB_bindless_texture : require`.
- Replace `uniform sampler2D uDiffuse` with a `uvec2` per-vertex
attribute (location 7) → reconstruct sampler in vertex shader OR
pass through to fragment via flat varying.
- Drop `uTranslucencyKind` uniform, OR keep it (still set per-pass —
multi-draw indirect doesn't break uniforms; only state that varies
per-draw is the constraint).
### Translucency
Multi-draw indirect can't change blend state mid-draw. Solution:
**still use two passes** (opaque + translucent), but within translucent
keep the per-blendfunc sub-passes (additive, alpha-blend, inv-alpha).
Three sub-passes within translucent. Each sub-pass = one
`glMultiDrawElementsIndirect` over its filtered groups.
Or: if perf allows, fold all four blend modes into the shader via
per-instance blendmode int, sort all translucent groups by blendmode
in the indirect buffer, switch blend state at sub-pass boundaries.
Brainstorm decides the cleanest pattern.
---
## Files to read before brainstorming
In rough order:
1. **N.4 plan + spec** — `docs/superpowers/plans/2026-05-08-phase-n4-rendering-foundation.md`
(status: Final). Adjustments 7-10 capture the gotchas. Spec at
`docs/superpowers/specs/2026-05-08-phase-n4-rendering-foundation-design.md`.
2. **N.4 dispatcher source** — `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`.
This is what you're modifying. Read end-to-end.
3. **WB's modern rendering shaders** — `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Shaders/StaticObjectModern.vert`
+ `StaticObjectModern.frag`. The template you're adapting from.
4. **WB's `ObjectMeshManager.UploadGfxObjMeshData`** — lines ~1654-1780
of `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs`.
Shows how WB sets up the modern path's VBO/IBO/VAO. Especially note
how it patches in instance attribute slots (locations 3-6) on the
global VAO and configures location 7+ for bindless handles.
5. **WB's `ObjectRenderBatch`** — same file, lines ~166-184. Note the
`BindlessTextureHandle` field — already populated when `_useModernRendering`
is on.
6. **Our `TextureCache`** — `src/AcDream.App/Rendering/TextureCache.cs`.
Three composite caches: by surface id, by surface+origTex, by
surface+origTex+palette. N.5 adds parallel bindless-handle caches.
7. **CLAUDE.md "WB integration cribs"** section. Lines ~28-80. The
three gotchas + the integration architecture in plain language.
8. **Memory: `project_phase_n4_state.md`** — same content from a
different angle. Reading both helps lock in the gotchas.
---
## Brainstorm questions
These are the questions to resolve in the brainstorm step. Don't
prejudge them — bring them to the user with options + recommendation:
1. **Instance attribute layout.** Match WB's `InstanceData` struct
(80 bytes including CellId/Flags fields we don't use) for global
VAO compatibility, or define a minimal acdream-specific version
(mat4 + handle = ~72 bytes padded to 80)?
2. **Bindless handle generation strategy.**
- At texture upload time? (Eager — every texture that lands in
`TextureCache` gets a handle. Memory cost ~per-texture state.)
- On first draw lookup? (Lazy — cache fills as scene exercises
content. Possible first-use stall.)
- At spawn time via the spawn adapter? (Tied to lifecycle. Cleanest
but requires touching the spawn path.)
3. **Translucent pass structure.** Three sub-indirect-draws (one per
blend mode) or a single sorted indirect buffer with per-instance
blend mode + state-flip at sub-pass boundaries? Or: just iterate
per-group like N.4 for translucent only (translucent groups are a
small fraction of total)?
4. **Persistent-mapped indirect + instance buffers.** Use
`GL_ARB_buffer_storage` + `MAP_PERSISTENT_BIT | MAP_COHERENT_BIT`?
Triple-buffered ring + sync object? Or stick with `glBufferData`
(still one upload per frame, just larger)? Persistent mapping is
~2-5% per-frame win in our context but adds buffer-management
complexity.
5. **Shader unification.** Keep `mesh_instanced` for legacy + add
`mesh_indirect` for modern, or replace `mesh_instanced` entirely?
Replacement requires the legacy `InstancedMeshRenderer` (escape
hatch under `ACDREAM_USE_WB_FOUNDATION=0`) to also use the new
shader, which... probably doesn't matter if we delete legacy in
N.6 anyway. Brainstorm.
6. **Conformance test strategy.** N.4 used visual verification at
Holtburg as the gate. N.5's gate is "no visual regression vs N.4
AND measurable CPU win." How do we measure CPU? `[WB-DIAG]`
counters give draw count + group count; we need frame-time
counters too. Add to the dispatcher? Use a profiler?
7. **Per-instance entity bindless.** `TextureCache.GetOrUpload*`
returns a GL name. The dispatcher (or `TextureCache` itself) needs
to convert that to a bindless handle. Design questions:
- Where does the conversion happen?
- When is the texture made resident? (Residency is global state;
too many resident textures hits driver limits.)
- What about palette/surface overrides — same caching key as the
name, just a parallel handle dictionary?
8. **Escape hatch.** N.4 keeps `ACDREAM_USE_WB_FOUNDATION=0` as a
fallback. N.5 needs to decide: does the new shader REPLACE the
N.4 dispatcher's draw path (so flag-on means N.5 modern path,
flag-off means legacy `InstancedMeshRenderer`)? Or do we add a
separate flag (`ACDREAM_USE_MODERN_DRAW`) so users can toggle
N.4 vs N.5 vs legacy independently? Three-way flag is more
complex but useful for A/B during rollout.
---
## Spec structure
After the brainstorm, the spec doc covers:
1. **Architecture diagram** — how `WbDrawDispatcher` changes shape.
Where the indirect buffer lives. Where bindless handles flow from.
2. **Instance data layout** — exact struct, byte offsets, GL attribute
pointer setup.
3. **TextureCache changes** — new methods, new cache, residency
policy.
4. **Shader files** — name(s), version, extensions, in/out variables.
5. **Conformance tests** — what to write, what coverage to claim.
6. **Acceptance criteria** — visual identity to N.4 + measured CPU
delta.
7. **Risks** — driver bugs in bindless / indirect, residency limits,
shader compile issues on weird GPUs, the legacy escape hatch
breaking.
Spec lives at: `docs/superpowers/specs/2026-05-XX-phase-n5-modern-rendering-design.md`.
## Plan structure
After the spec, the plan doc lays out the week-by-week task list.
Match N.4's plan structure (living document, task checkboxes, commit
SHAs appended, adjustments documented inline). Plan lives at:
`docs/superpowers/plans/2026-05-XX-phase-n5-modern-rendering.md`.
Suggested initial breakdown (brainstorm + spec will refine):
- **Week 1** — Plumbing: bindless handle generation in `TextureCache`,
shader rewrite (compile + bind), instance-attrib layout updated to
mat4+handle. Dispatcher still uses per-group draws but reads
textures bindless. Validate: visual identical to N.4.
- **Week 2** — Indirect: build `DrawElementsIndirectCommand` buffer
per frame, switch to `glMultiDrawElementsIndirect`. Three-pass
translucent (or whatever brainstorm decides). Validate: visual
identical, draw-call count drops to 2-4 per frame.
- **Week 3** — Polish + ship: persistent-mapped buffers if brainstorm
voted yes, profiler/counters, visual verification, flag flip, plan
finalization.
---
## Acceptance criteria for the whole phase
- Visual output identical to N.4 (no character regressions, no
scenery missing, no z-fighting introduced)
- `[WB-DIAG]` shows `drawsIssued` ≤ ~5 per frame (down from N.4's
few hundred)
- Frame time measurably lower in dense scenes (specify what scenes
to test in the spec — probably Holtburg courtyard + Foundry
interior)
- All tests still green (940/948 + any new conformance tests)
- `ACDREAM_USE_WB_FOUNDATION=0` escape hatch still works
- Plan doc finalized, roadmap updated, memory captured if N.5
surfaces durable lessons (it almost certainly will — bindless
+ indirect both have well-known driver gotchas)
---
## What you'll be doing in the first 30 minutes
1. Read this handoff in full.
2. Read CLAUDE.md "WB integration cribs" section.
3. Read `WbDrawDispatcher.cs` end-to-end.
4. Skim WB's `StaticObjectModern.vert/frag` + `ObjectMeshManager.UploadGfxObjMeshData`
to ground the reference.
5. Verify build is green: `dotnet build`.
6. Verify N.4 ship is intact: `dotnet test --filter "FullyQualifiedName~Wb|MatrixComposition"`
should produce 60 passing tests, 0 failures.
7. Invoke the `superpowers:brainstorming` skill with the user. Walk
through the 8 brainstorm questions above. Capture decisions in a
spec.
8. Write the spec at the path above.
9. Write the plan at the path above.
10. Begin Week 1 implementation per the plan.
Don't skip the brainstorm. Multi-draw indirect + bindless have several
real driver-compatibility / API-shape decisions that need user input,
not "the agent makes a call and goes." This phase is structurally the
same shape as N.4 — brainstorm → spec → plan → tasks-with-checkboxes →
commits-update-checkboxes → final SHIP commit.
---
## Things to NOT do
- **Don't delete the legacy `InstancedMeshRenderer`.** It's the N.4
escape hatch. N.6 retires it after N.5 is proven default-on.
- **Don't fork WB.** N.4 deliberately avoided fork patches by using
the side-table pattern (`AcSurfaceMetadataTable`). Stay on that
path. If you need data WB doesn't expose, add a side-table or
decode it yourself from dats.
- **Don't try to make per-instance entities use WB's `TextureAtlasManager`.**
That's N.6+ territory. acdream's `TextureCache` owns palette/surface
overrides because WB's atlas is keyed by `(surfaceId, paletteId,
stippling, isSolid)` and our overrides don't fit cleanly. Bindless
handles let us escape that mismatch — handles for both atlas-tier
AND per-instance-tier textures, no atlas adoption needed.
- **Don't skip visual verification.** N.4 surfaced three bugs at
visual verification that no test caught. Don't trust "build green +
tests pass" — exercise the rendering path with the local ACE server.
- **Don't extend the phase scope.** N.5 is bindless + indirect on
the existing rendering pipeline. Texture array atlas, GPU-side
culling, terrain wiring — all of those are subsequent phases. If
the brainstorm tries to expand, push back.
---
## Reference: the N.4 dispatcher flow you're modifying
```
Draw(camera, landblockEntries, frustum, ...) {
// Phase 1: walk entities, build groups
foreach (entity, meshRef, batch) {
cull, classify into _groups[GroupKey]
}
// Phase 2: lay matrices contiguously
// Phase 3: glBufferData(_instanceVbo, allMatrices)
// Phase 4: bind global VAO once
// Phase 5: opaque pass (sorted)
foreach (group in _opaqueDraws) {
glBindTexture(group.handle)
glBindBuffer(EBO, group.ibo)
glDrawElementsInstancedBaseVertexBaseInstance(...)
}
// Phase 6: translucent pass
}
```
After N.5, Phases 5 and 6 collapse to:
```
glBindBuffer(DRAW_INDIRECT_BUFFER, _opaqueIndirect)
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, opaqueGroups.Count, sizeof(DEIC))
glBindBuffer(DRAW_INDIRECT_BUFFER, _translucentIndirect)
// 3 sub-calls for translucent or 1 if shader-folded
glMultiDrawElementsIndirect(...)
```
That's the destination. Get there cleanly.
Good luck. Holler at the user if any of the brainstorm questions feel
genuinely ambiguous after reading the references — they care about
this phase landing right and will engage on design questions.