From dd5ca3d2b2f5636e332fe3878712d0ae8f77b8e5 Mon Sep 17 00:00:00 2001 From: Erik Date: Fri, 8 May 2026 18:05:36 +0200 Subject: [PATCH] docs(N.5): cold-start handoff for next session Detailed briefing for the next agent picking up Phase N.5 (Modern Rendering Path: bindless textures + glMultiDrawElementsIndirect on N.4's foundation). Covers: - Where N.4 left things (commits, what works, gotchas inherited) - The two-feature pairing (why bindless + indirect together) - Files to read first (WB shaders, our dispatcher, CLAUDE.md cribs) - 8 brainstorm questions to resolve before spec - Spec + plan structure (matching N.4's pattern) - Acceptance criteria - Things to explicitly NOT do Sized for a fresh session to pick up cold without spelunking through months of session history. Co-Authored-By: Claude Opus 4.6 --- docs/research/2026-05-08-phase-n5-handoff.md | 495 +++++++++++++++++++ 1 file changed, 495 insertions(+) create mode 100644 docs/research/2026-05-08-phase-n5-handoff.md diff --git a/docs/research/2026-05-08-phase-n5-handoff.md b/docs/research/2026-05-08-phase-n5-handoff.md new file mode 100644 index 0000000..1c4d7be --- /dev/null +++ b/docs/research/2026-05-08-phase-n5-handoff.md @@ -0,0 +1,495 @@ +# Phase N.5 — Modern Rendering Path — Cold-Start Handoff + +**Created:** 2026-05-08, immediately after N.4 ship. +**Audience:** the next agent picking up rendering perf work. +**Purpose:** give you everything you need to start N.5 cold, without +spelunking through five months of session history. + +--- + +## TL;DR + +N.4 just shipped: WB's `ObjectMeshManager` is now acdream's production +mesh pipeline, and `WbDrawDispatcher` is the production draw path. It +works (Holtburg renders correctly, FPS substantially improved over the +naïve dual-pipeline state we hit during week 4 verification) but it's +still doing per-group state changes (`glBindTexture`, `glBindBuffer` +for the IBO, `glDrawElementsInstancedBaseVertexBaseInstance` per group) +and a fresh `glBufferData` upload per frame. + +**N.5's job: lift the dispatcher onto WB's modern rendering primitives +that we're already paying GPU-feature-detection cost for.** Two big +wins, paired: + +1. **Bindless textures** (`GL_ARB_bindless_texture`) — WB already + populates `ObjectRenderBatch.BindlessTextureHandle`. Switch our + shader to read texture handles from a per-instance attribute + (`uvec2` → `sampler2D` via the bindless extension). Eliminates + 100% of `glBindTexture` calls. +2. **Multi-draw indirect** (`glMultiDrawElementsIndirect`) — build a + buffer of `DrawElementsIndirectCommand` structs (one per group), + upload once, fire ONE `glMultiDrawElementsIndirect` call per pass. + The driver pulls everything from the indirect buffer. + +Together they target a 2-5× CPU win on draw-heavy scenes (Holtburg +courtyard, Foundry, dense dungeons). They're packaged together because +both are "modern path" extensions we already gate on, both require +the same shader rewrite, and they pair naturally — multi-draw indirect +is a no-op CPU-win without bindless because per-group `glBindTexture` +calls would still serialize. + +**Estimated scope: 2-3 weeks.** Plan + spec to be written by the +brainstorm + spec steps below. + +--- + +## Where N.4 left things + +### Branch state + +If this handoff is being read on `main` after merging the N.4 worktree: +N.4 commits land at the head of main. The relevant final commits: + +- `c445364` — N.4 SHIP (flag default-on, plan final, roadmap, memory) +- `573526d` — perf pass 1-4 (drop dead lookup, sort, cull, hash memo) +- `7b41efc` — FirstIndex/BaseVertex + Issue #47 + grouped instanced +- `943652d` — load triggers + `batch.Key.SurfaceId` source +- `01cff41` — Tasks 22+23 (`WbDrawDispatcher` + side-table) + +If the worktree branch (`claude/tender-mcclintock-a16839`) hasn't been +merged yet, that's where the work is. Verify with `git log --oneline`. + +### What works in N.4 + +- `ACDREAM_USE_WB_FOUNDATION=1` is default-on. WB's `ObjectMeshManager` + loads, decodes, and uploads every entity mesh. Our existing + `TextureCache` decodes textures (palette-aware, per-instance overrides + via `GetOrUploadWithPaletteOverride`). +- `WbDrawDispatcher.Draw`: + - Walks visible entities (per-landblock AABB cull + per-entity AABB + cull + portal visibility) + - Buckets every (entity × meshRef × batch) tuple by + `GroupKey(Ibo, FirstIndex, BaseVertex, IndexCount, TextureHandle, Translucency)` + - Single `glBufferData` upload of all matrices for the frame + - Per group: `glActiveTexture(0) + glBindTexture(2D, handle) + glBindBuffer(EBO, ibo) + glDrawElementsInstancedBaseVertexBaseInstance(..., FirstInstance)` + - Two passes: opaque (front-to-back sorted) + translucent +- 940/948 tests pass (8 pre-existing failures unrelated to rendering). +- Visual verification at Holtburg passed: scenery + characters render + correctly with full close-detail geometry (Issue #47 preserved). + +### What N.5 inherits + +These are levers N.5 will pull on: + +- **WB's modern rendering is already active.** `OpenGLGraphicsDevice` + detected GL 4.3 + bindless on first run; WB's `_useModernRendering` + is true; every mesh lives in WB's single `GlobalMeshBuffer` (one VAO, + one VBO, one IBO). +- **Bindless handles are already populated.** `ObjectRenderBatch.BindlessTextureHandle` + is non-zero for batches WB owns the texture for. (See gotcha #2 + below for entities with palette overrides — those use acdream's + `TextureCache` which doesn't expose bindless handles yet.) +- **The instance VBO is acdream-owned** (`WbDrawDispatcher._instanceVbo`) + with locations 3-6 patched onto WB's global VAO. Stride 64 bytes + (one mat4). N.5 expands this to (mat4 + uvec2 handle) = 80 bytes. + +### Three load-bearing WB API gotchas N.4 surfaced + +These bit us hard during Task 26 visual verification. Documented in +CLAUDE.md "WB integration cribs" + plan adjustments 7-9 + +`memory/project_phase_n4_state.md`. Re-stating here because they +reshape the design space: + +1. **`ObjectMeshManager.IncrementRefCount(id)` is NOT lifecycle-aware.** + It only bumps a usage counter. Mesh loading is fired separately + via `PrepareMeshDataAsync(id, isSetup)`. The result auto-enqueues + to `_stagedMeshData` (line 510 of `ObjectMeshManager.cs`); our + existing `WbMeshAdapter.Tick()` drains it. `WbMeshAdapter.IncrementRefCount` + already calls `PrepareMeshDataAsync`. **N.5 doesn't need to change + this — just don't break it.** + +2. **`ObjectRenderBatch.SurfaceId` is unset.** WB constructs batches + with `Key = batch.Key` (a `TextureAtlasManager.TextureKey` struct + that has a `SurfaceId` field) but never populates the top-level + `SurfaceId` property. Read `batch.Key.SurfaceId`. **N.5 keeps this + pattern.** + +3. **WB's modern rendering packs every mesh into ONE global + VAO/VBO/IBO.** Each batch's `IBO` field points to the global IBO; + the batch's actual slice is identified by `FirstIndex` (offset into + IBO, in *indices*) and `BaseVertex` (offset into VBO, in *vertices*). + N.4's draw uses `glDrawElementsInstancedBaseVertexBaseInstance` + with those offsets. **N.5's `DrawElementsIndirectCommand` per-group + record will carry `firstIndex` + `baseVertex` for the same reason.** + +--- + +## What N.5 is — technical detail + +### The two-feature pairing + +**Bindless textures** (`GL_ARB_bindless_texture`): +- Each texture handle is a 64-bit integer (`uvec2` in GLSL). +- Shader declares `layout(bindless_sampler) uniform sampler2D ...` or + receives the handle as a per-vertex-attribute `uvec2`. +- No `glBindTexture` needed at draw time — the handle IS the binding. +- Handle generation: `glGetTextureHandleARB(textureId)` followed by + `glMakeTextureHandleResidentARB(handle)` (the texture must be + resident on the GPU; non-resident handles produce GPU faults). + +**Multi-draw indirect** (`glMultiDrawElementsIndirect`): +- Indirect command struct layout (must match `DrawElementsIndirectCommand`): + ```c + struct { + uint count; // index count for this draw + uint instanceCount; // number of instances + uint firstIndex; // offset into IBO, in indices + int baseVertex; // vertex offset into VBO + uint baseInstance; // first instance ID (offsets per-instance attribs) + }; + ``` +- Build a buffer of N of these structs (one per group), upload once, + fire one GL call: `glMultiDrawElementsIndirect(mode, indexType, ptr, drawcount, stride)`. +- The driver issues all N draws in one shot. Effectively zero CPU + overhead per draw beyond uploading the indirect buffer. + +**Why pair them.** Multi-draw indirect doesn't let you change uniform +state between draws. So if textures are bound via `glBindTexture` per +group, you'd still need N CPU-side setup steps before each indirect +call — defeating the purpose. Bindless removes that constraint by +encoding the texture handle as per-instance data the shader reads +directly. With both, the modern render loop becomes: + +``` +1. Upload instance buffer (mat4 + uvec2 handle, per-instance) — once per frame +2. Upload indirect command buffer (one DEIC per group) — once per frame +3. glBindVertexArray(globalVAO) — once +4. glMultiDrawElementsIndirect(...) — ONCE per pass +``` + +That's it. No per-group state changes. + +### Instance attribute layout + +Currently (N.4): location 3-6 = mat4 model matrix (16 floats = 64 bytes). + +N.5 (proposed): location 3-6 = mat4 + location 7 = uvec2 bindless +handle = 16 floats + 2 uints = 72 bytes (16-aligned to 80 bytes per +WB's `InstanceData` precedent). + +Or use std140-aligned struct: +```c +struct InstanceData { + mat4 transform; // locations 3-6 + uvec2 textureHandle; // location 7 + uvec2 _pad; // padding to 80 +}; +``` + +Brainstorm should decide if we copy WB's `InstanceData` struct (Pack=16, +80 bytes including CellId/Flags fields we don't use) or define our own +minimal version. The 80-byte stride matches WB's so global VAO state +configured by WB stays compatible if the legacy WB draw path ever runs. + +### Per-instance entity texture handles + +Here's the wrinkle. N.4 uses `WbDrawDispatcher.ResolveTexture` to map +each (entity, batch) to a GL texture handle: + +- Tree (no overrides): `_textures.GetOrUpload(surfaceId)` → 2D texture handle +- NPC with palette override: `_textures.GetOrUploadWithPaletteOverride(...)` → composite-cached 2D texture handle +- Anything with surface override: `_textures.GetOrUploadWithOrigTextureOverride(...)` → composite-cached 2D texture handle + +Those are all `GLuint` 32-bit GL texture *names*, not bindless handles. +**N.5 needs `TextureCache` to publish bindless handles for everything +it owns, not just WB-owned textures.** + +Implementation sketch: +- `TextureCache` adds a parallel cache keyed identically but storing + 64-bit bindless handles. On first request, generate via + `glGetTextureHandleARB(textureId)` + make resident. +- New API: `GetBindlessHandle(uint surfaceId, ...)` returns the handle. +- Or: change every `GetOrUpload*` method to return both the GL name + and the bindless handle (or just the handle; let GL name fall out + if anyone needs it later). + +WB's `ObjectRenderBatch.BindlessTextureHandle` covers the atlas-tier +case. For per-instance entities, we use `TextureCache`'s handle. + +### The new shader + +Reuse WB's `StaticObjectModern.vert` / `StaticObjectModern.frag` as a +template. Read those files cold. They already do bindless + the +instance-data layout. Adapt to acdream's `mesh_instanced.vert/frag` +conventions: + +- Keep the `uViewProjection` uniform, lighting UBO at binding=1, fog + uniforms. +- Add `#version 430 core` + `#extension GL_ARB_bindless_texture : require`. +- Replace `uniform sampler2D uDiffuse` with a `uvec2` per-vertex + attribute (location 7) → reconstruct sampler in vertex shader OR + pass through to fragment via flat varying. +- Drop `uTranslucencyKind` uniform, OR keep it (still set per-pass — + multi-draw indirect doesn't break uniforms; only state that varies + per-draw is the constraint). + +### Translucency + +Multi-draw indirect can't change blend state mid-draw. Solution: +**still use two passes** (opaque + translucent), but within translucent +keep the per-blendfunc sub-passes (additive, alpha-blend, inv-alpha). +Three sub-passes within translucent. Each sub-pass = one +`glMultiDrawElementsIndirect` over its filtered groups. + +Or: if perf allows, fold all four blend modes into the shader via +per-instance blendmode int, sort all translucent groups by blendmode +in the indirect buffer, switch blend state at sub-pass boundaries. +Brainstorm decides the cleanest pattern. + +--- + +## Files to read before brainstorming + +In rough order: + +1. **N.4 plan + spec** — `docs/superpowers/plans/2026-05-08-phase-n4-rendering-foundation.md` + (status: Final). Adjustments 7-10 capture the gotchas. Spec at + `docs/superpowers/specs/2026-05-08-phase-n4-rendering-foundation-design.md`. + +2. **N.4 dispatcher source** — `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`. + This is what you're modifying. Read end-to-end. + +3. **WB's modern rendering shaders** — `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Shaders/StaticObjectModern.vert` + + `StaticObjectModern.frag`. The template you're adapting from. + +4. **WB's `ObjectMeshManager.UploadGfxObjMeshData`** — lines ~1654-1780 + of `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs`. + Shows how WB sets up the modern path's VBO/IBO/VAO. Especially note + how it patches in instance attribute slots (locations 3-6) on the + global VAO and configures location 7+ for bindless handles. + +5. **WB's `ObjectRenderBatch`** — same file, lines ~166-184. Note the + `BindlessTextureHandle` field — already populated when `_useModernRendering` + is on. + +6. **Our `TextureCache`** — `src/AcDream.App/Rendering/TextureCache.cs`. + Three composite caches: by surface id, by surface+origTex, by + surface+origTex+palette. N.5 adds parallel bindless-handle caches. + +7. **CLAUDE.md "WB integration cribs"** section. Lines ~28-80. The + three gotchas + the integration architecture in plain language. + +8. **Memory: `project_phase_n4_state.md`** — same content from a + different angle. Reading both helps lock in the gotchas. + +--- + +## Brainstorm questions + +These are the questions to resolve in the brainstorm step. Don't +prejudge them — bring them to the user with options + recommendation: + +1. **Instance attribute layout.** Match WB's `InstanceData` struct + (80 bytes including CellId/Flags fields we don't use) for global + VAO compatibility, or define a minimal acdream-specific version + (mat4 + handle = ~72 bytes padded to 80)? + +2. **Bindless handle generation strategy.** + - At texture upload time? (Eager — every texture that lands in + `TextureCache` gets a handle. Memory cost ~per-texture state.) + - On first draw lookup? (Lazy — cache fills as scene exercises + content. Possible first-use stall.) + - At spawn time via the spawn adapter? (Tied to lifecycle. Cleanest + but requires touching the spawn path.) + +3. **Translucent pass structure.** Three sub-indirect-draws (one per + blend mode) or a single sorted indirect buffer with per-instance + blend mode + state-flip at sub-pass boundaries? Or: just iterate + per-group like N.4 for translucent only (translucent groups are a + small fraction of total)? + +4. **Persistent-mapped indirect + instance buffers.** Use + `GL_ARB_buffer_storage` + `MAP_PERSISTENT_BIT | MAP_COHERENT_BIT`? + Triple-buffered ring + sync object? Or stick with `glBufferData` + (still one upload per frame, just larger)? Persistent mapping is + ~2-5% per-frame win in our context but adds buffer-management + complexity. + +5. **Shader unification.** Keep `mesh_instanced` for legacy + add + `mesh_indirect` for modern, or replace `mesh_instanced` entirely? + Replacement requires the legacy `InstancedMeshRenderer` (escape + hatch under `ACDREAM_USE_WB_FOUNDATION=0`) to also use the new + shader, which... probably doesn't matter if we delete legacy in + N.6 anyway. Brainstorm. + +6. **Conformance test strategy.** N.4 used visual verification at + Holtburg as the gate. N.5's gate is "no visual regression vs N.4 + AND measurable CPU win." How do we measure CPU? `[WB-DIAG]` + counters give draw count + group count; we need frame-time + counters too. Add to the dispatcher? Use a profiler? + +7. **Per-instance entity bindless.** `TextureCache.GetOrUpload*` + returns a GL name. The dispatcher (or `TextureCache` itself) needs + to convert that to a bindless handle. Design questions: + - Where does the conversion happen? + - When is the texture made resident? (Residency is global state; + too many resident textures hits driver limits.) + - What about palette/surface overrides — same caching key as the + name, just a parallel handle dictionary? + +8. **Escape hatch.** N.4 keeps `ACDREAM_USE_WB_FOUNDATION=0` as a + fallback. N.5 needs to decide: does the new shader REPLACE the + N.4 dispatcher's draw path (so flag-on means N.5 modern path, + flag-off means legacy `InstancedMeshRenderer`)? Or do we add a + separate flag (`ACDREAM_USE_MODERN_DRAW`) so users can toggle + N.4 vs N.5 vs legacy independently? Three-way flag is more + complex but useful for A/B during rollout. + +--- + +## Spec structure + +After the brainstorm, the spec doc covers: + +1. **Architecture diagram** — how `WbDrawDispatcher` changes shape. + Where the indirect buffer lives. Where bindless handles flow from. +2. **Instance data layout** — exact struct, byte offsets, GL attribute + pointer setup. +3. **TextureCache changes** — new methods, new cache, residency + policy. +4. **Shader files** — name(s), version, extensions, in/out variables. +5. **Conformance tests** — what to write, what coverage to claim. +6. **Acceptance criteria** — visual identity to N.4 + measured CPU + delta. +7. **Risks** — driver bugs in bindless / indirect, residency limits, + shader compile issues on weird GPUs, the legacy escape hatch + breaking. + +Spec lives at: `docs/superpowers/specs/2026-05-XX-phase-n5-modern-rendering-design.md`. + +## Plan structure + +After the spec, the plan doc lays out the week-by-week task list. +Match N.4's plan structure (living document, task checkboxes, commit +SHAs appended, adjustments documented inline). Plan lives at: +`docs/superpowers/plans/2026-05-XX-phase-n5-modern-rendering.md`. + +Suggested initial breakdown (brainstorm + spec will refine): + +- **Week 1** — Plumbing: bindless handle generation in `TextureCache`, + shader rewrite (compile + bind), instance-attrib layout updated to + mat4+handle. Dispatcher still uses per-group draws but reads + textures bindless. Validate: visual identical to N.4. +- **Week 2** — Indirect: build `DrawElementsIndirectCommand` buffer + per frame, switch to `glMultiDrawElementsIndirect`. Three-pass + translucent (or whatever brainstorm decides). Validate: visual + identical, draw-call count drops to 2-4 per frame. +- **Week 3** — Polish + ship: persistent-mapped buffers if brainstorm + voted yes, profiler/counters, visual verification, flag flip, plan + finalization. + +--- + +## Acceptance criteria for the whole phase + +- Visual output identical to N.4 (no character regressions, no + scenery missing, no z-fighting introduced) +- `[WB-DIAG]` shows `drawsIssued` ≤ ~5 per frame (down from N.4's + few hundred) +- Frame time measurably lower in dense scenes (specify what scenes + to test in the spec — probably Holtburg courtyard + Foundry + interior) +- All tests still green (940/948 + any new conformance tests) +- `ACDREAM_USE_WB_FOUNDATION=0` escape hatch still works +- Plan doc finalized, roadmap updated, memory captured if N.5 + surfaces durable lessons (it almost certainly will — bindless + + indirect both have well-known driver gotchas) + +--- + +## What you'll be doing in the first 30 minutes + +1. Read this handoff in full. +2. Read CLAUDE.md "WB integration cribs" section. +3. Read `WbDrawDispatcher.cs` end-to-end. +4. Skim WB's `StaticObjectModern.vert/frag` + `ObjectMeshManager.UploadGfxObjMeshData` + to ground the reference. +5. Verify build is green: `dotnet build`. +6. Verify N.4 ship is intact: `dotnet test --filter "FullyQualifiedName~Wb|MatrixComposition"` + should produce 60 passing tests, 0 failures. +7. Invoke the `superpowers:brainstorming` skill with the user. Walk + through the 8 brainstorm questions above. Capture decisions in a + spec. +8. Write the spec at the path above. +9. Write the plan at the path above. +10. Begin Week 1 implementation per the plan. + +Don't skip the brainstorm. Multi-draw indirect + bindless have several +real driver-compatibility / API-shape decisions that need user input, +not "the agent makes a call and goes." This phase is structurally the +same shape as N.4 — brainstorm → spec → plan → tasks-with-checkboxes → +commits-update-checkboxes → final SHIP commit. + +--- + +## Things to NOT do + +- **Don't delete the legacy `InstancedMeshRenderer`.** It's the N.4 + escape hatch. N.6 retires it after N.5 is proven default-on. +- **Don't fork WB.** N.4 deliberately avoided fork patches by using + the side-table pattern (`AcSurfaceMetadataTable`). Stay on that + path. If you need data WB doesn't expose, add a side-table or + decode it yourself from dats. +- **Don't try to make per-instance entities use WB's `TextureAtlasManager`.** + That's N.6+ territory. acdream's `TextureCache` owns palette/surface + overrides because WB's atlas is keyed by `(surfaceId, paletteId, + stippling, isSolid)` and our overrides don't fit cleanly. Bindless + handles let us escape that mismatch — handles for both atlas-tier + AND per-instance-tier textures, no atlas adoption needed. +- **Don't skip visual verification.** N.4 surfaced three bugs at + visual verification that no test caught. Don't trust "build green + + tests pass" — exercise the rendering path with the local ACE server. +- **Don't extend the phase scope.** N.5 is bindless + indirect on + the existing rendering pipeline. Texture array atlas, GPU-side + culling, terrain wiring — all of those are subsequent phases. If + the brainstorm tries to expand, push back. + +--- + +## Reference: the N.4 dispatcher flow you're modifying + +``` +Draw(camera, landblockEntries, frustum, ...) { + // Phase 1: walk entities, build groups + foreach (entity, meshRef, batch) { + cull, classify into _groups[GroupKey] + } + + // Phase 2: lay matrices contiguously + // Phase 3: glBufferData(_instanceVbo, allMatrices) + // Phase 4: bind global VAO once + // Phase 5: opaque pass (sorted) + foreach (group in _opaqueDraws) { + glBindTexture(group.handle) + glBindBuffer(EBO, group.ibo) + glDrawElementsInstancedBaseVertexBaseInstance(...) + } + // Phase 6: translucent pass +} +``` + +After N.5, Phases 5 and 6 collapse to: + +``` +glBindBuffer(DRAW_INDIRECT_BUFFER, _opaqueIndirect) +glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, opaqueGroups.Count, sizeof(DEIC)) +glBindBuffer(DRAW_INDIRECT_BUFFER, _translucentIndirect) +// 3 sub-calls for translucent or 1 if shader-folded +glMultiDrawElementsIndirect(...) +``` + +That's the destination. Get there cleanly. + +Good luck. Holler at the user if any of the brainstorm questions feel +genuinely ambiguous after reading the references — they care about +this phase landing right and will engage on design questions.