Erik dd5ca3d2b2 docs(N.5): cold-start handoff for next session

Detailed briefing for the next agent picking up Phase N.5 (Modern
Rendering Path: bindless textures + glMultiDrawElementsIndirect on
N.4's foundation). Covers:

- Where N.4 left things (commits, what works, gotchas inherited)
- The two-feature pairing (why bindless + indirect together)
- Files to read first (WB shaders, our dispatcher, CLAUDE.md cribs)
- 8 brainstorm questions to resolve before spec
- Spec + plan structure (matching N.4's pattern)
- Acceptance criteria
- Things to explicitly NOT do

Sized for a fresh session to pick up cold without spelunking through
months of session history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-08 18:05:36 +02:00

22 KiB

Raw Blame History

Phase N.5 — Modern Rendering Path — Cold-Start Handoff

Created: 2026-05-08, immediately after N.4 ship. Audience: the next agent picking up rendering perf work. Purpose: give you everything you need to start N.5 cold, without spelunking through five months of session history.

TL;DR

N.4 just shipped: WB's ObjectMeshManager is now acdream's production mesh pipeline, and WbDrawDispatcher is the production draw path. It works (Holtburg renders correctly, FPS substantially improved over the naïve dual-pipeline state we hit during week 4 verification) but it's still doing per-group state changes (glBindTexture, glBindBuffer for the IBO, glDrawElementsInstancedBaseVertexBaseInstance per group) and a fresh glBufferData upload per frame.

N.5's job: lift the dispatcher onto WB's modern rendering primitives that we're already paying GPU-feature-detection cost for. Two big wins, paired:

Bindless textures (GL_ARB_bindless_texture) — WB already populates ObjectRenderBatch.BindlessTextureHandle. Switch our shader to read texture handles from a per-instance attribute (uvec2 → sampler2D via the bindless extension). Eliminates 100% of glBindTexture calls.
Multi-draw indirect (glMultiDrawElementsIndirect) — build a buffer of DrawElementsIndirectCommand structs (one per group), upload once, fire ONE glMultiDrawElementsIndirect call per pass. The driver pulls everything from the indirect buffer.

Together they target a 2-5× CPU win on draw-heavy scenes (Holtburg courtyard, Foundry, dense dungeons). They're packaged together because both are "modern path" extensions we already gate on, both require the same shader rewrite, and they pair naturally — multi-draw indirect is a no-op CPU-win without bindless because per-group glBindTexture calls would still serialize.

Estimated scope: 2-3 weeks. Plan + spec to be written by the brainstorm + spec steps below.

Where N.4 left things

Branch state

If this handoff is being read on main after merging the N.4 worktree: N.4 commits land at the head of main. The relevant final commits:

c445364 — N.4 SHIP (flag default-on, plan final, roadmap, memory)
573526d — perf pass 1-4 (drop dead lookup, sort, cull, hash memo)
7b41efc — FirstIndex/BaseVertex + Issue #47 + grouped instanced
943652d — load triggers + batch.Key.SurfaceId source
01cff41 — Tasks 22+23 (WbDrawDispatcher + side-table)

If the worktree branch (claude/tender-mcclintock-a16839) hasn't been merged yet, that's where the work is. Verify with git log --oneline.

What works in N.4

ACDREAM_USE_WB_FOUNDATION=1 is default-on. WB's ObjectMeshManager loads, decodes, and uploads every entity mesh. Our existing TextureCache decodes textures (palette-aware, per-instance overrides via GetOrUploadWithPaletteOverride).
WbDrawDispatcher.Draw:
- Walks visible entities (per-landblock AABB cull + per-entity AABB cull + portal visibility)
- Buckets every (entity × meshRef × batch) tuple by GroupKey(Ibo, FirstIndex, BaseVertex, IndexCount, TextureHandle, Translucency)
- Single glBufferData upload of all matrices for the frame
- Per group: glActiveTexture(0) + glBindTexture(2D, handle) + glBindBuffer(EBO, ibo) + glDrawElementsInstancedBaseVertexBaseInstance(..., FirstInstance)
- Two passes: opaque (front-to-back sorted) + translucent
940/948 tests pass (8 pre-existing failures unrelated to rendering).
Visual verification at Holtburg passed: scenery + characters render correctly with full close-detail geometry (Issue #47 preserved).

What N.5 inherits

These are levers N.5 will pull on:

WB's modern rendering is already active. OpenGLGraphicsDevice detected GL 4.3 + bindless on first run; WB's _useModernRendering is true; every mesh lives in WB's single GlobalMeshBuffer (one VAO, one VBO, one IBO).
Bindless handles are already populated. ObjectRenderBatch.BindlessTextureHandle is non-zero for batches WB owns the texture for. (See gotcha #2 below for entities with palette overrides — those use acdream's TextureCache which doesn't expose bindless handles yet.)
The instance VBO is acdream-owned (WbDrawDispatcher._instanceVbo) with locations 3-6 patched onto WB's global VAO. Stride 64 bytes (one mat4). N.5 expands this to (mat4 + uvec2 handle) = 80 bytes.

Three load-bearing WB API gotchas N.4 surfaced

These bit us hard during Task 26 visual verification. Documented in CLAUDE.md "WB integration cribs" + plan adjustments 7-9 + memory/project_phase_n4_state.md. Re-stating here because they reshape the design space:

ObjectMeshManager.IncrementRefCount(id) is NOT lifecycle-aware. It only bumps a usage counter. Mesh loading is fired separately via PrepareMeshDataAsync(id, isSetup). The result auto-enqueues to _stagedMeshData (line 510 of ObjectMeshManager.cs); our existing WbMeshAdapter.Tick() drains it. WbMeshAdapter.IncrementRefCount already calls PrepareMeshDataAsync. N.5 doesn't need to change this — just don't break it.
ObjectRenderBatch.SurfaceId is unset. WB constructs batches with Key = batch.Key (a TextureAtlasManager.TextureKey struct that has a SurfaceId field) but never populates the top-level SurfaceId property. Read batch.Key.SurfaceId. N.5 keeps this pattern.
WB's modern rendering packs every mesh into ONE global VAO/VBO/IBO. Each batch's IBO field points to the global IBO; the batch's actual slice is identified by FirstIndex (offset into IBO, in indices) and BaseVertex (offset into VBO, in vertices). N.4's draw uses glDrawElementsInstancedBaseVertexBaseInstance with those offsets. N.5's DrawElementsIndirectCommand per-group record will carry firstIndex + baseVertex for the same reason.

What N.5 is — technical detail

The two-feature pairing

Bindless textures (GL_ARB_bindless_texture):

Each texture handle is a 64-bit integer (uvec2 in GLSL).
Shader declares layout(bindless_sampler) uniform sampler2D ... or receives the handle as a per-vertex-attribute uvec2.
No glBindTexture needed at draw time — the handle IS the binding.
Handle generation: glGetTextureHandleARB(textureId) followed by glMakeTextureHandleResidentARB(handle) (the texture must be resident on the GPU; non-resident handles produce GPU faults).

Multi-draw indirect (glMultiDrawElementsIndirect):

Indirect command struct layout (must match DrawElementsIndirectCommand):

struct {
    uint count;          // index count for this draw
    uint instanceCount;  // number of instances
    uint firstIndex;     // offset into IBO, in indices
    int  baseVertex;     // vertex offset into VBO
    uint baseInstance;   // first instance ID (offsets per-instance attribs)
};

Build a buffer of N of these structs (one per group), upload once, fire one GL call: glMultiDrawElementsIndirect(mode, indexType, ptr, drawcount, stride).
The driver issues all N draws in one shot. Effectively zero CPU overhead per draw beyond uploading the indirect buffer.

Why pair them. Multi-draw indirect doesn't let you change uniform state between draws. So if textures are bound via glBindTexture per group, you'd still need N CPU-side setup steps before each indirect call — defeating the purpose. Bindless removes that constraint by encoding the texture handle as per-instance data the shader reads directly. With both, the modern render loop becomes:

1. Upload instance buffer (mat4 + uvec2 handle, per-instance) — once per frame
2. Upload indirect command buffer (one DEIC per group) — once per frame
3. glBindVertexArray(globalVAO) — once
4. glMultiDrawElementsIndirect(...) — ONCE per pass

That's it. No per-group state changes.

Instance attribute layout

Currently (N.4): location 3-6 = mat4 model matrix (16 floats = 64 bytes).

N.5 (proposed): location 3-6 = mat4 + location 7 = uvec2 bindless handle = 16 floats + 2 uints = 72 bytes (16-aligned to 80 bytes per WB's InstanceData precedent).

Or use std140-aligned struct:

struct InstanceData {
    mat4 transform;        // locations 3-6
    uvec2 textureHandle;   // location 7
    uvec2 _pad;            // padding to 80
};

Brainstorm should decide if we copy WB's InstanceData struct (Pack=16, 80 bytes including CellId/Flags fields we don't use) or define our own minimal version. The 80-byte stride matches WB's so global VAO state configured by WB stays compatible if the legacy WB draw path ever runs.

Per-instance entity texture handles

Here's the wrinkle. N.4 uses WbDrawDispatcher.ResolveTexture to map each (entity, batch) to a GL texture handle:

Tree (no overrides): _textures.GetOrUpload(surfaceId) → 2D texture handle
NPC with palette override: _textures.GetOrUploadWithPaletteOverride(...) → composite-cached 2D texture handle
Anything with surface override: _textures.GetOrUploadWithOrigTextureOverride(...) → composite-cached 2D texture handle

Those are all GLuint 32-bit GL texture names, not bindless handles. N.5 needs TextureCache to publish bindless handles for everything it owns, not just WB-owned textures.

Implementation sketch:

TextureCache adds a parallel cache keyed identically but storing 64-bit bindless handles. On first request, generate via glGetTextureHandleARB(textureId) + make resident.
New API: GetBindlessHandle(uint surfaceId, ...) returns the handle.
Or: change every GetOrUpload* method to return both the GL name and the bindless handle (or just the handle; let GL name fall out if anyone needs it later).

WB's ObjectRenderBatch.BindlessTextureHandle covers the atlas-tier case. For per-instance entities, we use TextureCache's handle.

The new shader

Reuse WB's StaticObjectModern.vert / StaticObjectModern.frag as a template. Read those files cold. They already do bindless + the instance-data layout. Adapt to acdream's mesh_instanced.vert/frag conventions:

Keep the uViewProjection uniform, lighting UBO at binding=1, fog uniforms.
Add #version 430 core + #extension GL_ARB_bindless_texture : require.
Replace uniform sampler2D uDiffuse with a uvec2 per-vertex attribute (location 7) → reconstruct sampler in vertex shader OR pass through to fragment via flat varying.
Drop uTranslucencyKind uniform, OR keep it (still set per-pass — multi-draw indirect doesn't break uniforms; only state that varies per-draw is the constraint).

Translucency

Multi-draw indirect can't change blend state mid-draw. Solution: still use two passes (opaque + translucent), but within translucent keep the per-blendfunc sub-passes (additive, alpha-blend, inv-alpha). Three sub-passes within translucent. Each sub-pass = one glMultiDrawElementsIndirect over its filtered groups.

Or: if perf allows, fold all four blend modes into the shader via per-instance blendmode int, sort all translucent groups by blendmode in the indirect buffer, switch blend state at sub-pass boundaries. Brainstorm decides the cleanest pattern.

Files to read before brainstorming

In rough order:

N.4 plan + spec — docs/superpowers/plans/2026-05-08-phase-n4-rendering-foundation.md (status: Final). Adjustments 7-10 capture the gotchas. Spec at docs/superpowers/specs/2026-05-08-phase-n4-rendering-foundation-design.md.
N.4 dispatcher source — src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs. This is what you're modifying. Read end-to-end.
WB's modern rendering shaders — references/WorldBuilder/Chorizite.OpenGLSDLBackend/Shaders/StaticObjectModern.vert
- StaticObjectModern.frag. The template you're adapting from.
WB's ObjectMeshManager.UploadGfxObjMeshData — lines ~1654-1780 of references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs. Shows how WB sets up the modern path's VBO/IBO/VAO. Especially note how it patches in instance attribute slots (locations 3-6) on the global VAO and configures location 7+ for bindless handles.
WB's ObjectRenderBatch — same file, lines ~166-184. Note the BindlessTextureHandle field — already populated when _useModernRendering is on.
Our TextureCache — src/AcDream.App/Rendering/TextureCache.cs. Three composite caches: by surface id, by surface+origTex, by surface+origTex+palette. N.5 adds parallel bindless-handle caches.
CLAUDE.md "WB integration cribs" section. Lines ~28-80. The three gotchas + the integration architecture in plain language.
Memory: project_phase_n4_state.md — same content from a different angle. Reading both helps lock in the gotchas.

Brainstorm questions

These are the questions to resolve in the brainstorm step. Don't prejudge them — bring them to the user with options + recommendation:

Instance attribute layout. Match WB's InstanceData struct (80 bytes including CellId/Flags fields we don't use) for global VAO compatibility, or define a minimal acdream-specific version (mat4 + handle = ~72 bytes padded to 80)?
Bindless handle generation strategy.
- At texture upload time? (Eager — every texture that lands in TextureCache gets a handle. Memory cost ~per-texture state.)
- On first draw lookup? (Lazy — cache fills as scene exercises content. Possible first-use stall.)
- At spawn time via the spawn adapter? (Tied to lifecycle. Cleanest but requires touching the spawn path.)
Translucent pass structure. Three sub-indirect-draws (one per blend mode) or a single sorted indirect buffer with per-instance blend mode + state-flip at sub-pass boundaries? Or: just iterate per-group like N.4 for translucent only (translucent groups are a small fraction of total)?
Persistent-mapped indirect + instance buffers. Use GL_ARB_buffer_storage + MAP_PERSISTENT_BIT | MAP_COHERENT_BIT? Triple-buffered ring + sync object? Or stick with glBufferData (still one upload per frame, just larger)? Persistent mapping is ~2-5% per-frame win in our context but adds buffer-management complexity.
Shader unification. Keep mesh_instanced for legacy + add mesh_indirect for modern, or replace mesh_instanced entirely? Replacement requires the legacy InstancedMeshRenderer (escape hatch under ACDREAM_USE_WB_FOUNDATION=0) to also use the new shader, which... probably doesn't matter if we delete legacy in N.6 anyway. Brainstorm.
Conformance test strategy. N.4 used visual verification at Holtburg as the gate. N.5's gate is "no visual regression vs N.4 AND measurable CPU win." How do we measure CPU? [WB-DIAG] counters give draw count + group count; we need frame-time counters too. Add to the dispatcher? Use a profiler?
Per-instance entity bindless. TextureCache.GetOrUpload* returns a GL name. The dispatcher (or TextureCache itself) needs to convert that to a bindless handle. Design questions:
- Where does the conversion happen?
- When is the texture made resident? (Residency is global state; too many resident textures hits driver limits.)
- What about palette/surface overrides — same caching key as the name, just a parallel handle dictionary?
Escape hatch. N.4 keeps ACDREAM_USE_WB_FOUNDATION=0 as a fallback. N.5 needs to decide: does the new shader REPLACE the N.4 dispatcher's draw path (so flag-on means N.5 modern path, flag-off means legacy InstancedMeshRenderer)? Or do we add a separate flag (ACDREAM_USE_MODERN_DRAW) so users can toggle N.4 vs N.5 vs legacy independently? Three-way flag is more complex but useful for A/B during rollout.

Spec structure

After the brainstorm, the spec doc covers:

Architecture diagram — how WbDrawDispatcher changes shape. Where the indirect buffer lives. Where bindless handles flow from.
Instance data layout — exact struct, byte offsets, GL attribute pointer setup.
TextureCache changes — new methods, new cache, residency policy.
Shader files — name(s), version, extensions, in/out variables.
Conformance tests — what to write, what coverage to claim.
Acceptance criteria — visual identity to N.4 + measured CPU delta.
Risks — driver bugs in bindless / indirect, residency limits, shader compile issues on weird GPUs, the legacy escape hatch breaking.

Spec lives at: docs/superpowers/specs/2026-05-XX-phase-n5-modern-rendering-design.md.

Plan structure

After the spec, the plan doc lays out the week-by-week task list. Match N.4's plan structure (living document, task checkboxes, commit SHAs appended, adjustments documented inline). Plan lives at: docs/superpowers/plans/2026-05-XX-phase-n5-modern-rendering.md.

Suggested initial breakdown (brainstorm + spec will refine):

Week 1 — Plumbing: bindless handle generation in TextureCache, shader rewrite (compile + bind), instance-attrib layout updated to mat4+handle. Dispatcher still uses per-group draws but reads textures bindless. Validate: visual identical to N.4.
Week 2 — Indirect: build DrawElementsIndirectCommand buffer per frame, switch to glMultiDrawElementsIndirect. Three-pass translucent (or whatever brainstorm decides). Validate: visual identical, draw-call count drops to 2-4 per frame.
Week 3 — Polish + ship: persistent-mapped buffers if brainstorm voted yes, profiler/counters, visual verification, flag flip, plan finalization.

Acceptance criteria for the whole phase

Visual output identical to N.4 (no character regressions, no scenery missing, no z-fighting introduced)
[WB-DIAG] shows drawsIssued ≤ ~5 per frame (down from N.4's few hundred)
Frame time measurably lower in dense scenes (specify what scenes to test in the spec — probably Holtburg courtyard + Foundry interior)
All tests still green (940/948 + any new conformance tests)
ACDREAM_USE_WB_FOUNDATION=0 escape hatch still works
Plan doc finalized, roadmap updated, memory captured if N.5 surfaces durable lessons (it almost certainly will — bindless
- indirect both have well-known driver gotchas)

What you'll be doing in the first 30 minutes

Read this handoff in full.
Read CLAUDE.md "WB integration cribs" section.
Read WbDrawDispatcher.cs end-to-end.
Skim WB's StaticObjectModern.vert/frag + ObjectMeshManager.UploadGfxObjMeshData to ground the reference.
Verify build is green: dotnet build.
Verify N.4 ship is intact: dotnet test --filter "FullyQualifiedName~Wb|MatrixComposition" should produce 60 passing tests, 0 failures.
Invoke the superpowers:brainstorming skill with the user. Walk through the 8 brainstorm questions above. Capture decisions in a spec.
Write the spec at the path above.
Write the plan at the path above.
Begin Week 1 implementation per the plan.

Don't skip the brainstorm. Multi-draw indirect + bindless have several real driver-compatibility / API-shape decisions that need user input, not "the agent makes a call and goes." This phase is structurally the same shape as N.4 — brainstorm → spec → plan → tasks-with-checkboxes → commits-update-checkboxes → final SHIP commit.

Things to NOT do

Don't delete the legacy InstancedMeshRenderer. It's the N.4 escape hatch. N.6 retires it after N.5 is proven default-on.
Don't fork WB. N.4 deliberately avoided fork patches by using the side-table pattern (AcSurfaceMetadataTable). Stay on that path. If you need data WB doesn't expose, add a side-table or decode it yourself from dats.
Don't try to make per-instance entities use WB's TextureAtlasManager. That's N.6+ territory. acdream's TextureCache owns palette/surface overrides because WB's atlas is keyed by (surfaceId, paletteId, stippling, isSolid) and our overrides don't fit cleanly. Bindless handles let us escape that mismatch — handles for both atlas-tier AND per-instance-tier textures, no atlas adoption needed.
Don't skip visual verification. N.4 surfaced three bugs at visual verification that no test caught. Don't trust "build green + tests pass" — exercise the rendering path with the local ACE server.
Don't extend the phase scope. N.5 is bindless + indirect on the existing rendering pipeline. Texture array atlas, GPU-side culling, terrain wiring — all of those are subsequent phases. If the brainstorm tries to expand, push back.

Reference: the N.4 dispatcher flow you're modifying

Draw(camera, landblockEntries, frustum, ...) {
  // Phase 1: walk entities, build groups
  foreach (entity, meshRef, batch) {
    cull, classify into _groups[GroupKey]
  }

  // Phase 2: lay matrices contiguously
  // Phase 3: glBufferData(_instanceVbo, allMatrices)
  // Phase 4: bind global VAO once
  // Phase 5: opaque pass (sorted)
  foreach (group in _opaqueDraws) {
    glBindTexture(group.handle)
    glBindBuffer(EBO, group.ibo)
    glDrawElementsInstancedBaseVertexBaseInstance(...)
  }
  // Phase 6: translucent pass
}

After N.5, Phases 5 and 6 collapse to:

glBindBuffer(DRAW_INDIRECT_BUFFER, _opaqueIndirect)
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, opaqueGroups.Count, sizeof(DEIC))
glBindBuffer(DRAW_INDIRECT_BUFFER, _translucentIndirect)
// 3 sub-calls for translucent or 1 if shader-folded
glMultiDrawElementsIndirect(...)

That's the destination. Get there cleanly.

Good luck. Holler at the user if any of the brainstorm questions feel genuinely ambiguous after reading the references — they care about this phase landing right and will engage on design questions.

22 KiB Raw Blame History Unescape Escape