# Phase N.5 — Modern Rendering Path — Cold-Start Handoff **Created:** 2026-05-08, immediately after N.4 ship. **Audience:** the next agent picking up rendering perf work. **Purpose:** give you everything you need to start N.5 cold, without spelunking through five months of session history. --- ## TL;DR N.4 just shipped: WB's `ObjectMeshManager` is now acdream's production mesh pipeline, and `WbDrawDispatcher` is the production draw path. It works (Holtburg renders correctly, FPS substantially improved over the naïve dual-pipeline state we hit during week 4 verification) but it's still doing per-group state changes (`glBindTexture`, `glBindBuffer` for the IBO, `glDrawElementsInstancedBaseVertexBaseInstance` per group) and a fresh `glBufferData` upload per frame. **N.5's job: lift the dispatcher onto WB's modern rendering primitives that we're already paying GPU-feature-detection cost for.** Two big wins, paired: 1. **Bindless textures** (`GL_ARB_bindless_texture`) — WB already populates `ObjectRenderBatch.BindlessTextureHandle`. Switch our shader to read texture handles from a per-instance attribute (`uvec2` → `sampler2D` via the bindless extension). Eliminates 100% of `glBindTexture` calls. 2. **Multi-draw indirect** (`glMultiDrawElementsIndirect`) — build a buffer of `DrawElementsIndirectCommand` structs (one per group), upload once, fire ONE `glMultiDrawElementsIndirect` call per pass. The driver pulls everything from the indirect buffer. Together they target a 2-5× CPU win on draw-heavy scenes (Holtburg courtyard, Foundry, dense dungeons). They're packaged together because both are "modern path" extensions we already gate on, both require the same shader rewrite, and they pair naturally — multi-draw indirect is a no-op CPU-win without bindless because per-group `glBindTexture` calls would still serialize. **Estimated scope: 2-3 weeks.** Plan + spec to be written by the brainstorm + spec steps below. --- ## Where N.4 left things ### Branch state If this handoff is being read on `main` after merging the N.4 worktree: N.4 commits land at the head of main. The relevant final commits: - `c445364` — N.4 SHIP (flag default-on, plan final, roadmap, memory) - `573526d` — perf pass 1-4 (drop dead lookup, sort, cull, hash memo) - `7b41efc` — FirstIndex/BaseVertex + Issue #47 + grouped instanced - `943652d` — load triggers + `batch.Key.SurfaceId` source - `01cff41` — Tasks 22+23 (`WbDrawDispatcher` + side-table) If the worktree branch (`claude/tender-mcclintock-a16839`) hasn't been merged yet, that's where the work is. Verify with `git log --oneline`. ### What works in N.4 - `ACDREAM_USE_WB_FOUNDATION=1` is default-on. WB's `ObjectMeshManager` loads, decodes, and uploads every entity mesh. Our existing `TextureCache` decodes textures (palette-aware, per-instance overrides via `GetOrUploadWithPaletteOverride`). - `WbDrawDispatcher.Draw`: - Walks visible entities (per-landblock AABB cull + per-entity AABB cull + portal visibility) - Buckets every (entity × meshRef × batch) tuple by `GroupKey(Ibo, FirstIndex, BaseVertex, IndexCount, TextureHandle, Translucency)` - Single `glBufferData` upload of all matrices for the frame - Per group: `glActiveTexture(0) + glBindTexture(2D, handle) + glBindBuffer(EBO, ibo) + glDrawElementsInstancedBaseVertexBaseInstance(..., FirstInstance)` - Two passes: opaque (front-to-back sorted) + translucent - 940/948 tests pass (8 pre-existing failures unrelated to rendering). - Visual verification at Holtburg passed: scenery + characters render correctly with full close-detail geometry (Issue #47 preserved). ### What N.5 inherits These are levers N.5 will pull on: - **WB's modern rendering is already active.** `OpenGLGraphicsDevice` detected GL 4.3 + bindless on first run; WB's `_useModernRendering` is true; every mesh lives in WB's single `GlobalMeshBuffer` (one VAO, one VBO, one IBO). - **Bindless handles are already populated.** `ObjectRenderBatch.BindlessTextureHandle` is non-zero for batches WB owns the texture for. (See gotcha #2 below for entities with palette overrides — those use acdream's `TextureCache` which doesn't expose bindless handles yet.) - **The instance VBO is acdream-owned** (`WbDrawDispatcher._instanceVbo`) with locations 3-6 patched onto WB's global VAO. Stride 64 bytes (one mat4). N.5 expands this to (mat4 + uvec2 handle) = 80 bytes. ### Three load-bearing WB API gotchas N.4 surfaced These bit us hard during Task 26 visual verification. Documented in CLAUDE.md "WB integration cribs" + plan adjustments 7-9 + `memory/project_phase_n4_state.md`. Re-stating here because they reshape the design space: 1. **`ObjectMeshManager.IncrementRefCount(id)` is NOT lifecycle-aware.** It only bumps a usage counter. Mesh loading is fired separately via `PrepareMeshDataAsync(id, isSetup)`. The result auto-enqueues to `_stagedMeshData` (line 510 of `ObjectMeshManager.cs`); our existing `WbMeshAdapter.Tick()` drains it. `WbMeshAdapter.IncrementRefCount` already calls `PrepareMeshDataAsync`. **N.5 doesn't need to change this — just don't break it.** 2. **`ObjectRenderBatch.SurfaceId` is unset.** WB constructs batches with `Key = batch.Key` (a `TextureAtlasManager.TextureKey` struct that has a `SurfaceId` field) but never populates the top-level `SurfaceId` property. Read `batch.Key.SurfaceId`. **N.5 keeps this pattern.** 3. **WB's modern rendering packs every mesh into ONE global VAO/VBO/IBO.** Each batch's `IBO` field points to the global IBO; the batch's actual slice is identified by `FirstIndex` (offset into IBO, in *indices*) and `BaseVertex` (offset into VBO, in *vertices*). N.4's draw uses `glDrawElementsInstancedBaseVertexBaseInstance` with those offsets. **N.5's `DrawElementsIndirectCommand` per-group record will carry `firstIndex` + `baseVertex` for the same reason.** --- ## What N.5 is — technical detail ### The two-feature pairing **Bindless textures** (`GL_ARB_bindless_texture`): - Each texture handle is a 64-bit integer (`uvec2` in GLSL). - Shader declares `layout(bindless_sampler) uniform sampler2D ...` or receives the handle as a per-vertex-attribute `uvec2`. - No `glBindTexture` needed at draw time — the handle IS the binding. - Handle generation: `glGetTextureHandleARB(textureId)` followed by `glMakeTextureHandleResidentARB(handle)` (the texture must be resident on the GPU; non-resident handles produce GPU faults). **Multi-draw indirect** (`glMultiDrawElementsIndirect`): - Indirect command struct layout (must match `DrawElementsIndirectCommand`): ```c struct { uint count; // index count for this draw uint instanceCount; // number of instances uint firstIndex; // offset into IBO, in indices int baseVertex; // vertex offset into VBO uint baseInstance; // first instance ID (offsets per-instance attribs) }; ``` - Build a buffer of N of these structs (one per group), upload once, fire one GL call: `glMultiDrawElementsIndirect(mode, indexType, ptr, drawcount, stride)`. - The driver issues all N draws in one shot. Effectively zero CPU overhead per draw beyond uploading the indirect buffer. **Why pair them.** Multi-draw indirect doesn't let you change uniform state between draws. So if textures are bound via `glBindTexture` per group, you'd still need N CPU-side setup steps before each indirect call — defeating the purpose. Bindless removes that constraint by encoding the texture handle as per-instance data the shader reads directly. With both, the modern render loop becomes: ``` 1. Upload instance buffer (mat4 + uvec2 handle, per-instance) — once per frame 2. Upload indirect command buffer (one DEIC per group) — once per frame 3. glBindVertexArray(globalVAO) — once 4. glMultiDrawElementsIndirect(...) — ONCE per pass ``` That's it. No per-group state changes. ### Instance attribute layout Currently (N.4): location 3-6 = mat4 model matrix (16 floats = 64 bytes). N.5 (proposed): location 3-6 = mat4 + location 7 = uvec2 bindless handle = 16 floats + 2 uints = 72 bytes (16-aligned to 80 bytes per WB's `InstanceData` precedent). Or use std140-aligned struct: ```c struct InstanceData { mat4 transform; // locations 3-6 uvec2 textureHandle; // location 7 uvec2 _pad; // padding to 80 }; ``` Brainstorm should decide if we copy WB's `InstanceData` struct (Pack=16, 80 bytes including CellId/Flags fields we don't use) or define our own minimal version. The 80-byte stride matches WB's so global VAO state configured by WB stays compatible if the legacy WB draw path ever runs. ### Per-instance entity texture handles Here's the wrinkle. N.4 uses `WbDrawDispatcher.ResolveTexture` to map each (entity, batch) to a GL texture handle: - Tree (no overrides): `_textures.GetOrUpload(surfaceId)` → 2D texture handle - NPC with palette override: `_textures.GetOrUploadWithPaletteOverride(...)` → composite-cached 2D texture handle - Anything with surface override: `_textures.GetOrUploadWithOrigTextureOverride(...)` → composite-cached 2D texture handle Those are all `GLuint` 32-bit GL texture *names*, not bindless handles. **N.5 needs `TextureCache` to publish bindless handles for everything it owns, not just WB-owned textures.** Implementation sketch: - `TextureCache` adds a parallel cache keyed identically but storing 64-bit bindless handles. On first request, generate via `glGetTextureHandleARB(textureId)` + make resident. - New API: `GetBindlessHandle(uint surfaceId, ...)` returns the handle. - Or: change every `GetOrUpload*` method to return both the GL name and the bindless handle (or just the handle; let GL name fall out if anyone needs it later). WB's `ObjectRenderBatch.BindlessTextureHandle` covers the atlas-tier case. For per-instance entities, we use `TextureCache`'s handle. ### The new shader Reuse WB's `StaticObjectModern.vert` / `StaticObjectModern.frag` as a template. Read those files cold. They already do bindless + the instance-data layout. Adapt to acdream's `mesh_instanced.vert/frag` conventions: - Keep the `uViewProjection` uniform, lighting UBO at binding=1, fog uniforms. - Add `#version 430 core` + `#extension GL_ARB_bindless_texture : require`. - Replace `uniform sampler2D uDiffuse` with a `uvec2` per-vertex attribute (location 7) → reconstruct sampler in vertex shader OR pass through to fragment via flat varying. - Drop `uTranslucencyKind` uniform, OR keep it (still set per-pass — multi-draw indirect doesn't break uniforms; only state that varies per-draw is the constraint). ### Translucency Multi-draw indirect can't change blend state mid-draw. Solution: **still use two passes** (opaque + translucent), but within translucent keep the per-blendfunc sub-passes (additive, alpha-blend, inv-alpha). Three sub-passes within translucent. Each sub-pass = one `glMultiDrawElementsIndirect` over its filtered groups. Or: if perf allows, fold all four blend modes into the shader via per-instance blendmode int, sort all translucent groups by blendmode in the indirect buffer, switch blend state at sub-pass boundaries. Brainstorm decides the cleanest pattern. --- ## Files to read before brainstorming In rough order: 1. **N.4 plan + spec** — `docs/superpowers/plans/2026-05-08-phase-n4-rendering-foundation.md` (status: Final). Adjustments 7-10 capture the gotchas. Spec at `docs/superpowers/specs/2026-05-08-phase-n4-rendering-foundation-design.md`. 2. **N.4 dispatcher source** — `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`. This is what you're modifying. Read end-to-end. 3. **WB's modern rendering shaders** — `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Shaders/StaticObjectModern.vert` + `StaticObjectModern.frag`. The template you're adapting from. 4. **WB's `ObjectMeshManager.UploadGfxObjMeshData`** — lines ~1654-1780 of `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs`. Shows how WB sets up the modern path's VBO/IBO/VAO. Especially note how it patches in instance attribute slots (locations 3-6) on the global VAO and configures location 7+ for bindless handles. 5. **WB's `ObjectRenderBatch`** — same file, lines ~166-184. Note the `BindlessTextureHandle` field — already populated when `_useModernRendering` is on. 6. **Our `TextureCache`** — `src/AcDream.App/Rendering/TextureCache.cs`. Three composite caches: by surface id, by surface+origTex, by surface+origTex+palette. N.5 adds parallel bindless-handle caches. 7. **CLAUDE.md "WB integration cribs"** section. Lines ~28-80. The three gotchas + the integration architecture in plain language. 8. **Memory: `project_phase_n4_state.md`** — same content from a different angle. Reading both helps lock in the gotchas. --- ## Brainstorm questions These are the questions to resolve in the brainstorm step. Don't prejudge them — bring them to the user with options + recommendation: 1. **Instance attribute layout.** Match WB's `InstanceData` struct (80 bytes including CellId/Flags fields we don't use) for global VAO compatibility, or define a minimal acdream-specific version (mat4 + handle = ~72 bytes padded to 80)? 2. **Bindless handle generation strategy.** - At texture upload time? (Eager — every texture that lands in `TextureCache` gets a handle. Memory cost ~per-texture state.) - On first draw lookup? (Lazy — cache fills as scene exercises content. Possible first-use stall.) - At spawn time via the spawn adapter? (Tied to lifecycle. Cleanest but requires touching the spawn path.) 3. **Translucent pass structure.** Three sub-indirect-draws (one per blend mode) or a single sorted indirect buffer with per-instance blend mode + state-flip at sub-pass boundaries? Or: just iterate per-group like N.4 for translucent only (translucent groups are a small fraction of total)? 4. **Persistent-mapped indirect + instance buffers.** Use `GL_ARB_buffer_storage` + `MAP_PERSISTENT_BIT | MAP_COHERENT_BIT`? Triple-buffered ring + sync object? Or stick with `glBufferData` (still one upload per frame, just larger)? Persistent mapping is ~2-5% per-frame win in our context but adds buffer-management complexity. 5. **Shader unification.** Keep `mesh_instanced` for legacy + add `mesh_indirect` for modern, or replace `mesh_instanced` entirely? Replacement requires the legacy `InstancedMeshRenderer` (escape hatch under `ACDREAM_USE_WB_FOUNDATION=0`) to also use the new shader, which... probably doesn't matter if we delete legacy in N.6 anyway. Brainstorm. 6. **Conformance test strategy.** N.4 used visual verification at Holtburg as the gate. N.5's gate is "no visual regression vs N.4 AND measurable CPU win." How do we measure CPU? `[WB-DIAG]` counters give draw count + group count; we need frame-time counters too. Add to the dispatcher? Use a profiler? 7. **Per-instance entity bindless.** `TextureCache.GetOrUpload*` returns a GL name. The dispatcher (or `TextureCache` itself) needs to convert that to a bindless handle. Design questions: - Where does the conversion happen? - When is the texture made resident? (Residency is global state; too many resident textures hits driver limits.) - What about palette/surface overrides — same caching key as the name, just a parallel handle dictionary? 8. **Escape hatch.** N.4 keeps `ACDREAM_USE_WB_FOUNDATION=0` as a fallback. N.5 needs to decide: does the new shader REPLACE the N.4 dispatcher's draw path (so flag-on means N.5 modern path, flag-off means legacy `InstancedMeshRenderer`)? Or do we add a separate flag (`ACDREAM_USE_MODERN_DRAW`) so users can toggle N.4 vs N.5 vs legacy independently? Three-way flag is more complex but useful for A/B during rollout. --- ## Spec structure After the brainstorm, the spec doc covers: 1. **Architecture diagram** — how `WbDrawDispatcher` changes shape. Where the indirect buffer lives. Where bindless handles flow from. 2. **Instance data layout** — exact struct, byte offsets, GL attribute pointer setup. 3. **TextureCache changes** — new methods, new cache, residency policy. 4. **Shader files** — name(s), version, extensions, in/out variables. 5. **Conformance tests** — what to write, what coverage to claim. 6. **Acceptance criteria** — visual identity to N.4 + measured CPU delta. 7. **Risks** — driver bugs in bindless / indirect, residency limits, shader compile issues on weird GPUs, the legacy escape hatch breaking. Spec lives at: `docs/superpowers/specs/2026-05-XX-phase-n5-modern-rendering-design.md`. ## Plan structure After the spec, the plan doc lays out the week-by-week task list. Match N.4's plan structure (living document, task checkboxes, commit SHAs appended, adjustments documented inline). Plan lives at: `docs/superpowers/plans/2026-05-XX-phase-n5-modern-rendering.md`. Suggested initial breakdown (brainstorm + spec will refine): - **Week 1** — Plumbing: bindless handle generation in `TextureCache`, shader rewrite (compile + bind), instance-attrib layout updated to mat4+handle. Dispatcher still uses per-group draws but reads textures bindless. Validate: visual identical to N.4. - **Week 2** — Indirect: build `DrawElementsIndirectCommand` buffer per frame, switch to `glMultiDrawElementsIndirect`. Three-pass translucent (or whatever brainstorm decides). Validate: visual identical, draw-call count drops to 2-4 per frame. - **Week 3** — Polish + ship: persistent-mapped buffers if brainstorm voted yes, profiler/counters, visual verification, flag flip, plan finalization. --- ## Acceptance criteria for the whole phase - Visual output identical to N.4 (no character regressions, no scenery missing, no z-fighting introduced) - `[WB-DIAG]` shows `drawsIssued` ≤ ~5 per frame (down from N.4's few hundred) - Frame time measurably lower in dense scenes (specify what scenes to test in the spec — probably Holtburg courtyard + Foundry interior) - All tests still green (940/948 + any new conformance tests) - `ACDREAM_USE_WB_FOUNDATION=0` escape hatch still works - Plan doc finalized, roadmap updated, memory captured if N.5 surfaces durable lessons (it almost certainly will — bindless + indirect both have well-known driver gotchas) --- ## What you'll be doing in the first 30 minutes 1. Read this handoff in full. 2. Read CLAUDE.md "WB integration cribs" section. 3. Read `WbDrawDispatcher.cs` end-to-end. 4. Skim WB's `StaticObjectModern.vert/frag` + `ObjectMeshManager.UploadGfxObjMeshData` to ground the reference. 5. Verify build is green: `dotnet build`. 6. Verify N.4 ship is intact: `dotnet test --filter "FullyQualifiedName~Wb|MatrixComposition"` should produce 60 passing tests, 0 failures. 7. Invoke the `superpowers:brainstorming` skill with the user. Walk through the 8 brainstorm questions above. Capture decisions in a spec. 8. Write the spec at the path above. 9. Write the plan at the path above. 10. Begin Week 1 implementation per the plan. Don't skip the brainstorm. Multi-draw indirect + bindless have several real driver-compatibility / API-shape decisions that need user input, not "the agent makes a call and goes." This phase is structurally the same shape as N.4 — brainstorm → spec → plan → tasks-with-checkboxes → commits-update-checkboxes → final SHIP commit. --- ## Things to NOT do - **Don't delete the legacy `InstancedMeshRenderer`.** It's the N.4 escape hatch. N.6 retires it after N.5 is proven default-on. - **Don't fork WB.** N.4 deliberately avoided fork patches by using the side-table pattern (`AcSurfaceMetadataTable`). Stay on that path. If you need data WB doesn't expose, add a side-table or decode it yourself from dats. - **Don't try to make per-instance entities use WB's `TextureAtlasManager`.** That's N.6+ territory. acdream's `TextureCache` owns palette/surface overrides because WB's atlas is keyed by `(surfaceId, paletteId, stippling, isSolid)` and our overrides don't fit cleanly. Bindless handles let us escape that mismatch — handles for both atlas-tier AND per-instance-tier textures, no atlas adoption needed. - **Don't skip visual verification.** N.4 surfaced three bugs at visual verification that no test caught. Don't trust "build green + tests pass" — exercise the rendering path with the local ACE server. - **Don't extend the phase scope.** N.5 is bindless + indirect on the existing rendering pipeline. Texture array atlas, GPU-side culling, terrain wiring — all of those are subsequent phases. If the brainstorm tries to expand, push back. --- ## Reference: the N.4 dispatcher flow you're modifying ``` Draw(camera, landblockEntries, frustum, ...) { // Phase 1: walk entities, build groups foreach (entity, meshRef, batch) { cull, classify into _groups[GroupKey] } // Phase 2: lay matrices contiguously // Phase 3: glBufferData(_instanceVbo, allMatrices) // Phase 4: bind global VAO once // Phase 5: opaque pass (sorted) foreach (group in _opaqueDraws) { glBindTexture(group.handle) glBindBuffer(EBO, group.ibo) glDrawElementsInstancedBaseVertexBaseInstance(...) } // Phase 6: translucent pass } ``` After N.5, Phases 5 and 6 collapse to: ``` glBindBuffer(DRAW_INDIRECT_BUFFER, _opaqueIndirect) glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, opaqueGroups.Count, sizeof(DEIC)) glBindBuffer(DRAW_INDIRECT_BUFFER, _translucentIndirect) // 3 sub-calls for translucent or 1 if shader-folded glMultiDrawElementsIndirect(...) ``` That's the destination. Get there cleanly. Good luck. Holler at the user if any of the brainstorm questions feel genuinely ambiguous after reading the references — they care about this phase landing right and will engage on design questions.