acdream/docs/research/2026-05-08-phase-n5-handoff.md

# Phase N.5 — Modern Rendering Path — Cold-Start Handoff

**Created:** 2026-05-08, immediately after N.4 ship.
**Audience:** the next agent picking up rendering perf work.
**Purpose:** give you everything you need to start N.5 cold, without
spelunking through five months of session history.

---

## TL;DR

N.4 just shipped: WB's `ObjectMeshManager` is now acdream's production
mesh pipeline, and `WbDrawDispatcher` is the production draw path. It
works (Holtburg renders correctly, FPS substantially improved over the
naïve dual-pipeline state we hit during week 4 verification) but it's
still doing per-group state changes (`glBindTexture`, `glBindBuffer`
for the IBO, `glDrawElementsInstancedBaseVertexBaseInstance` per group)
and a fresh `glBufferData` upload per frame.

**N.5's job: lift the dispatcher onto WB's modern rendering primitives
that we're already paying GPU-feature-detection cost for.** Two big
wins, paired:

1. **Bindless textures** (`GL_ARB_bindless_texture`) — WB already
   populates `ObjectRenderBatch.BindlessTextureHandle`. Switch our
   shader to read texture handles from a per-instance attribute
   (`uvec2` → `sampler2D` via the bindless extension). Eliminates
   100% of `glBindTexture` calls.
2. **Multi-draw indirect** (`glMultiDrawElementsIndirect`) — build a
   buffer of `DrawElementsIndirectCommand` structs (one per group),
   upload once, fire ONE `glMultiDrawElementsIndirect` call per pass.
   The driver pulls everything from the indirect buffer.

Together they target a 2-5× CPU win on draw-heavy scenes (Holtburg
courtyard, Foundry, dense dungeons). They're packaged together because
both are "modern path" extensions we already gate on, both require
the same shader rewrite, and they pair naturally — multi-draw indirect
is a no-op CPU-win without bindless because per-group `glBindTexture`
calls would still serialize.

**Estimated scope: 2-3 weeks.** Plan + spec to be written by the
brainstorm + spec steps below.

---

## Where N.4 left things

### Branch state

If this handoff is being read on `main` after merging the N.4 worktree:
N.4 commits land at the head of main. The relevant final commits:

- `c445364` — N.4 SHIP (flag default-on, plan final, roadmap, memory)
- `573526d` — perf pass 1-4 (drop dead lookup, sort, cull, hash memo)
- `7b41efc` — FirstIndex/BaseVertex + Issue #47 + grouped instanced
- `943652d` — load triggers + `batch.Key.SurfaceId` source
- `01cff41` — Tasks 22+23 (`WbDrawDispatcher` + side-table)

If the worktree branch (`claude/tender-mcclintock-a16839`) hasn't been
merged yet, that's where the work is. Verify with `git log --oneline`.

### What works in N.4

- `ACDREAM_USE_WB_FOUNDATION=1` is default-on. WB's `ObjectMeshManager`
  loads, decodes, and uploads every entity mesh. Our existing
  `TextureCache` decodes textures (palette-aware, per-instance overrides
  via `GetOrUploadWithPaletteOverride`).
- `WbDrawDispatcher.Draw`:
  - Walks visible entities (per-landblock AABB cull + per-entity AABB
    cull + portal visibility)
  - Buckets every (entity × meshRef × batch) tuple by
    `GroupKey(Ibo, FirstIndex, BaseVertex, IndexCount, TextureHandle, Translucency)`
  - Single `glBufferData` upload of all matrices for the frame
  - Per group: `glActiveTexture(0) + glBindTexture(2D, handle) + glBindBuffer(EBO, ibo) + glDrawElementsInstancedBaseVertexBaseInstance(..., FirstInstance)`
  - Two passes: opaque (front-to-back sorted) + translucent
- 940/948 tests pass (8 pre-existing failures unrelated to rendering).
- Visual verification at Holtburg passed: scenery + characters render
  correctly with full close-detail geometry (Issue #47 preserved).

### What N.5 inherits

These are levers N.5 will pull on:

- **WB's modern rendering is already active.** `OpenGLGraphicsDevice`
  detected GL 4.3 + bindless on first run; WB's `_useModernRendering`
  is true; every mesh lives in WB's single `GlobalMeshBuffer` (one VAO,
  one VBO, one IBO).
- **Bindless handles are already populated.** `ObjectRenderBatch.BindlessTextureHandle`
  is non-zero for batches WB owns the texture for. (See gotcha #2
  below for entities with palette overrides — those use acdream's
  `TextureCache` which doesn't expose bindless handles yet.)
- **The instance VBO is acdream-owned** (`WbDrawDispatcher._instanceVbo`)
  with locations 3-6 patched onto WB's global VAO. Stride 64 bytes
  (one mat4). N.5 expands this to (mat4 + uvec2 handle) = 80 bytes.

### Three load-bearing WB API gotchas N.4 surfaced

These bit us hard during Task 26 visual verification. Documented in
CLAUDE.md "WB integration cribs" + plan adjustments 7-9 +
`memory/project_phase_n4_state.md`. Re-stating here because they
reshape the design space:

1. **`ObjectMeshManager.IncrementRefCount(id)` is NOT lifecycle-aware.**
   It only bumps a usage counter. Mesh loading is fired separately
   via `PrepareMeshDataAsync(id, isSetup)`. The result auto-enqueues
   to `_stagedMeshData` (line 510 of `ObjectMeshManager.cs`); our
   existing `WbMeshAdapter.Tick()` drains it. `WbMeshAdapter.IncrementRefCount`
   already calls `PrepareMeshDataAsync`. **N.5 doesn't need to change
   this — just don't break it.**

2. **`ObjectRenderBatch.SurfaceId` is unset.** WB constructs batches
   with `Key = batch.Key` (a `TextureAtlasManager.TextureKey` struct
   that has a `SurfaceId` field) but never populates the top-level
   `SurfaceId` property. Read `batch.Key.SurfaceId`. **N.5 keeps this
   pattern.**

3. **WB's modern rendering packs every mesh into ONE global
   VAO/VBO/IBO.** Each batch's `IBO` field points to the global IBO;
   the batch's actual slice is identified by `FirstIndex` (offset into
   IBO, in *indices*) and `BaseVertex` (offset into VBO, in *vertices*).
   N.4's draw uses `glDrawElementsInstancedBaseVertexBaseInstance`
   with those offsets. **N.5's `DrawElementsIndirectCommand` per-group
   record will carry `firstIndex` + `baseVertex` for the same reason.**

---

## What N.5 is — technical detail

### The two-feature pairing

**Bindless textures** (`GL_ARB_bindless_texture`):
- Each texture handle is a 64-bit integer (`uvec2` in GLSL).
- Shader declares `layout(bindless_sampler) uniform sampler2D ...` or
  receives the handle as a per-vertex-attribute `uvec2`.
- No `glBindTexture` needed at draw time — the handle IS the binding.
- Handle generation: `glGetTextureHandleARB(textureId)` followed by
  `glMakeTextureHandleResidentARB(handle)` (the texture must be
  resident on the GPU; non-resident handles produce GPU faults).

**Multi-draw indirect** (`glMultiDrawElementsIndirect`):
- Indirect command struct layout (must match `DrawElementsIndirectCommand`):
  ```c
  struct {
      uint count;          // index count for this draw
      uint instanceCount;  // number of instances
      uint firstIndex;     // offset into IBO, in indices
      int  baseVertex;     // vertex offset into VBO
      uint baseInstance;   // first instance ID (offsets per-instance attribs)
  };
  ```
- Build a buffer of N of these structs (one per group), upload once,
  fire one GL call: `glMultiDrawElementsIndirect(mode, indexType, ptr, drawcount, stride)`.
- The driver issues all N draws in one shot. Effectively zero CPU
  overhead per draw beyond uploading the indirect buffer.

**Why pair them.** Multi-draw indirect doesn't let you change uniform
state between draws. So if textures are bound via `glBindTexture` per
group, you'd still need N CPU-side setup steps before each indirect
call — defeating the purpose. Bindless removes that constraint by
encoding the texture handle as per-instance data the shader reads
directly. With both, the modern render loop becomes:

```
1. Upload instance buffer (mat4 + uvec2 handle, per-instance) — once per frame
2. Upload indirect command buffer (one DEIC per group) — once per frame
3. glBindVertexArray(globalVAO) — once
4. glMultiDrawElementsIndirect(...) — ONCE per pass
```

That's it. No per-group state changes.

### Instance attribute layout

Currently (N.4): location 3-6 = mat4 model matrix (16 floats = 64 bytes).

N.5 (proposed): location 3-6 = mat4 + location 7 = uvec2 bindless
handle = 16 floats + 2 uints = 72 bytes (16-aligned to 80 bytes per
WB's `InstanceData` precedent).

Or use std140-aligned struct:
```c
struct InstanceData {
    mat4 transform;        // locations 3-6
    uvec2 textureHandle;   // location 7
    uvec2 _pad;            // padding to 80
};
```

Brainstorm should decide if we copy WB's `InstanceData` struct (Pack=16,
80 bytes including CellId/Flags fields we don't use) or define our own
minimal version. The 80-byte stride matches WB's so global VAO state
configured by WB stays compatible if the legacy WB draw path ever runs.

### Per-instance entity texture handles

Here's the wrinkle. N.4 uses `WbDrawDispatcher.ResolveTexture` to map
each (entity, batch) to a GL texture handle:

- Tree (no overrides): `_textures.GetOrUpload(surfaceId)` → 2D texture handle
- NPC with palette override: `_textures.GetOrUploadWithPaletteOverride(...)` → composite-cached 2D texture handle
- Anything with surface override: `_textures.GetOrUploadWithOrigTextureOverride(...)` → composite-cached 2D texture handle

Those are all `GLuint` 32-bit GL texture *names*, not bindless handles.
**N.5 needs `TextureCache` to publish bindless handles for everything
it owns, not just WB-owned textures.**

Implementation sketch:
- `TextureCache` adds a parallel cache keyed identically but storing
  64-bit bindless handles. On first request, generate via
  `glGetTextureHandleARB(textureId)` + make resident.
- New API: `GetBindlessHandle(uint surfaceId, ...)` returns the handle.
- Or: change every `GetOrUpload*` method to return both the GL name
  and the bindless handle (or just the handle; let GL name fall out
  if anyone needs it later).

WB's `ObjectRenderBatch.BindlessTextureHandle` covers the atlas-tier
case. For per-instance entities, we use `TextureCache`'s handle.

### The new shader

Reuse WB's `StaticObjectModern.vert` / `StaticObjectModern.frag` as a
template. Read those files cold. They already do bindless + the
instance-data layout. Adapt to acdream's `mesh_instanced.vert/frag`
conventions:

- Keep the `uViewProjection` uniform, lighting UBO at binding=1, fog
  uniforms.
- Add `#version 430 core` + `#extension GL_ARB_bindless_texture : require`.
- Replace `uniform sampler2D uDiffuse` with a `uvec2` per-vertex
  attribute (location 7) → reconstruct sampler in vertex shader OR
  pass through to fragment via flat varying.
- Drop `uTranslucencyKind` uniform, OR keep it (still set per-pass —
  multi-draw indirect doesn't break uniforms; only state that varies
  per-draw is the constraint).

### Translucency

Multi-draw indirect can't change blend state mid-draw. Solution:
**still use two passes** (opaque + translucent), but within translucent
keep the per-blendfunc sub-passes (additive, alpha-blend, inv-alpha).
Three sub-passes within translucent. Each sub-pass = one
`glMultiDrawElementsIndirect` over its filtered groups.

Or: if perf allows, fold all four blend modes into the shader via
per-instance blendmode int, sort all translucent groups by blendmode
in the indirect buffer, switch blend state at sub-pass boundaries.
Brainstorm decides the cleanest pattern.

---

## Files to read before brainstorming

In rough order:

1. **N.4 plan + spec** — `docs/superpowers/plans/2026-05-08-phase-n4-rendering-foundation.md`
   (status: Final). Adjustments 7-10 capture the gotchas. Spec at
   `docs/superpowers/specs/2026-05-08-phase-n4-rendering-foundation-design.md`.

2. **N.4 dispatcher source** — `src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`.
   This is what you're modifying. Read end-to-end.

3. **WB's modern rendering shaders** — `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Shaders/StaticObjectModern.vert`
   + `StaticObjectModern.frag`. The template you're adapting from.

4. **WB's `ObjectMeshManager.UploadGfxObjMeshData`** — lines ~1654-1780
   of `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs`.
   Shows how WB sets up the modern path's VBO/IBO/VAO. Especially note
   how it patches in instance attribute slots (locations 3-6) on the
   global VAO and configures location 7+ for bindless handles.

5. **WB's `ObjectRenderBatch`** — same file, lines ~166-184. Note the
   `BindlessTextureHandle` field — already populated when `_useModernRendering`
   is on.

6. **Our `TextureCache`** — `src/AcDream.App/Rendering/TextureCache.cs`.
   Three composite caches: by surface id, by surface+origTex, by
   surface+origTex+palette. N.5 adds parallel bindless-handle caches.

7. **CLAUDE.md "WB integration cribs"** section. Lines ~28-80. The
   three gotchas + the integration architecture in plain language.

8. **Memory: `project_phase_n4_state.md`** — same content from a
   different angle. Reading both helps lock in the gotchas.

---

## Brainstorm questions

These are the questions to resolve in the brainstorm step. Don't
prejudge them — bring them to the user with options + recommendation:

1. **Instance attribute layout.** Match WB's `InstanceData` struct
   (80 bytes including CellId/Flags fields we don't use) for global
   VAO compatibility, or define a minimal acdream-specific version
   (mat4 + handle = ~72 bytes padded to 80)?

2. **Bindless handle generation strategy.**
   - At texture upload time? (Eager — every texture that lands in
     `TextureCache` gets a handle. Memory cost ~per-texture state.)
   - On first draw lookup? (Lazy — cache fills as scene exercises
     content. Possible first-use stall.)
   - At spawn time via the spawn adapter? (Tied to lifecycle. Cleanest
     but requires touching the spawn path.)

3. **Translucent pass structure.** Three sub-indirect-draws (one per
   blend mode) or a single sorted indirect buffer with per-instance
   blend mode + state-flip at sub-pass boundaries? Or: just iterate
   per-group like N.4 for translucent only (translucent groups are a
   small fraction of total)?

4. **Persistent-mapped indirect + instance buffers.** Use
   `GL_ARB_buffer_storage` + `MAP_PERSISTENT_BIT | MAP_COHERENT_BIT`?
   Triple-buffered ring + sync object? Or stick with `glBufferData`
   (still one upload per frame, just larger)? Persistent mapping is
   ~2-5% per-frame win in our context but adds buffer-management
   complexity.

5. **Shader unification.** Keep `mesh_instanced` for legacy + add
   `mesh_indirect` for modern, or replace `mesh_instanced` entirely?
   Replacement requires the legacy `InstancedMeshRenderer` (escape
   hatch under `ACDREAM_USE_WB_FOUNDATION=0`) to also use the new
   shader, which... probably doesn't matter if we delete legacy in
   N.6 anyway. Brainstorm.

6. **Conformance test strategy.** N.4 used visual verification at
   Holtburg as the gate. N.5's gate is "no visual regression vs N.4
   AND measurable CPU win." How do we measure CPU? `[WB-DIAG]`
   counters give draw count + group count; we need frame-time
   counters too. Add to the dispatcher? Use a profiler?

7. **Per-instance entity bindless.** `TextureCache.GetOrUpload*`
   returns a GL name. The dispatcher (or `TextureCache` itself) needs
   to convert that to a bindless handle. Design questions:
   - Where does the conversion happen?
   - When is the texture made resident? (Residency is global state;
     too many resident textures hits driver limits.)
   - What about palette/surface overrides — same caching key as the
     name, just a parallel handle dictionary?

8. **Escape hatch.** N.4 keeps `ACDREAM_USE_WB_FOUNDATION=0` as a
   fallback. N.5 needs to decide: does the new shader REPLACE the
   N.4 dispatcher's draw path (so flag-on means N.5 modern path,
   flag-off means legacy `InstancedMeshRenderer`)? Or do we add a
   separate flag (`ACDREAM_USE_MODERN_DRAW`) so users can toggle
   N.4 vs N.5 vs legacy independently? Three-way flag is more
   complex but useful for A/B during rollout.

---

## Spec structure

After the brainstorm, the spec doc covers:

1. **Architecture diagram** — how `WbDrawDispatcher` changes shape.
   Where the indirect buffer lives. Where bindless handles flow from.
2. **Instance data layout** — exact struct, byte offsets, GL attribute
   pointer setup.
3. **TextureCache changes** — new methods, new cache, residency
   policy.
4. **Shader files** — name(s), version, extensions, in/out variables.
5. **Conformance tests** — what to write, what coverage to claim.
6. **Acceptance criteria** — visual identity to N.4 + measured CPU
   delta.
7. **Risks** — driver bugs in bindless / indirect, residency limits,
   shader compile issues on weird GPUs, the legacy escape hatch
   breaking.

Spec lives at: `docs/superpowers/specs/2026-05-XX-phase-n5-modern-rendering-design.md`.

## Plan structure

After the spec, the plan doc lays out the week-by-week task list.
Match N.4's plan structure (living document, task checkboxes, commit
SHAs appended, adjustments documented inline). Plan lives at:
`docs/superpowers/plans/2026-05-XX-phase-n5-modern-rendering.md`.

Suggested initial breakdown (brainstorm + spec will refine):

- **Week 1** — Plumbing: bindless handle generation in `TextureCache`,
  shader rewrite (compile + bind), instance-attrib layout updated to
  mat4+handle. Dispatcher still uses per-group draws but reads
  textures bindless. Validate: visual identical to N.4.
- **Week 2** — Indirect: build `DrawElementsIndirectCommand` buffer
  per frame, switch to `glMultiDrawElementsIndirect`. Three-pass
  translucent (or whatever brainstorm decides). Validate: visual
  identical, draw-call count drops to 2-4 per frame.
- **Week 3** — Polish + ship: persistent-mapped buffers if brainstorm
  voted yes, profiler/counters, visual verification, flag flip, plan
  finalization.

---

## Acceptance criteria for the whole phase

- Visual output identical to N.4 (no character regressions, no
  scenery missing, no z-fighting introduced)
- `[WB-DIAG]` shows `drawsIssued` ≤ ~5 per frame (down from N.4's
  few hundred)
- Frame time measurably lower in dense scenes (specify what scenes
  to test in the spec — probably Holtburg courtyard + Foundry
  interior)
- All tests still green (940/948 + any new conformance tests)
- `ACDREAM_USE_WB_FOUNDATION=0` escape hatch still works
- Plan doc finalized, roadmap updated, memory captured if N.5
  surfaces durable lessons (it almost certainly will — bindless
  + indirect both have well-known driver gotchas)

---

## What you'll be doing in the first 30 minutes

1. Read this handoff in full.
2. Read CLAUDE.md "WB integration cribs" section.
3. Read `WbDrawDispatcher.cs` end-to-end.
4. Skim WB's `StaticObjectModern.vert/frag` + `ObjectMeshManager.UploadGfxObjMeshData`
   to ground the reference.
5. Verify build is green: `dotnet build`.
6. Verify N.4 ship is intact: `dotnet test --filter "FullyQualifiedName~Wb|MatrixComposition"`
   should produce 60 passing tests, 0 failures.
7. Invoke the `superpowers:brainstorming` skill with the user. Walk
   through the 8 brainstorm questions above. Capture decisions in a
   spec.
8. Write the spec at the path above.
9. Write the plan at the path above.
10. Begin Week 1 implementation per the plan.

Don't skip the brainstorm. Multi-draw indirect + bindless have several
real driver-compatibility / API-shape decisions that need user input,
not "the agent makes a call and goes." This phase is structurally the
same shape as N.4 — brainstorm → spec → plan → tasks-with-checkboxes →
commits-update-checkboxes → final SHIP commit.

---

## Things to NOT do

- **Don't delete the legacy `InstancedMeshRenderer`.** It's the N.4
  escape hatch. N.6 retires it after N.5 is proven default-on.
- **Don't fork WB.** N.4 deliberately avoided fork patches by using
  the side-table pattern (`AcSurfaceMetadataTable`). Stay on that
  path. If you need data WB doesn't expose, add a side-table or
  decode it yourself from dats.
- **Don't try to make per-instance entities use WB's `TextureAtlasManager`.**
  That's N.6+ territory. acdream's `TextureCache` owns palette/surface
  overrides because WB's atlas is keyed by `(surfaceId, paletteId,
  stippling, isSolid)` and our overrides don't fit cleanly. Bindless
  handles let us escape that mismatch — handles for both atlas-tier
  AND per-instance-tier textures, no atlas adoption needed.
- **Don't skip visual verification.** N.4 surfaced three bugs at
  visual verification that no test caught. Don't trust "build green +
  tests pass" — exercise the rendering path with the local ACE server.
- **Don't extend the phase scope.** N.5 is bindless + indirect on
  the existing rendering pipeline. Texture array atlas, GPU-side
  culling, terrain wiring — all of those are subsequent phases. If
  the brainstorm tries to expand, push back.

---

## Reference: the N.4 dispatcher flow you're modifying

```
Draw(camera, landblockEntries, frustum, ...) {
  // Phase 1: walk entities, build groups
  foreach (entity, meshRef, batch) {
    cull, classify into _groups[GroupKey]
  }

  // Phase 2: lay matrices contiguously
  // Phase 3: glBufferData(_instanceVbo, allMatrices)
  // Phase 4: bind global VAO once
  // Phase 5: opaque pass (sorted)
  foreach (group in _opaqueDraws) {
    glBindTexture(group.handle)
    glBindBuffer(EBO, group.ibo)
    glDrawElementsInstancedBaseVertexBaseInstance(...)
  }
  // Phase 6: translucent pass
}
```

After N.5, Phases 5 and 6 collapse to:

```
glBindBuffer(DRAW_INDIRECT_BUFFER, _opaqueIndirect)
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, opaqueGroups.Count, sizeof(DEIC))
glBindBuffer(DRAW_INDIRECT_BUFFER, _translucentIndirect)
// 3 sub-calls for translucent or 1 if shader-folded
glMultiDrawElementsIndirect(...)
```

That's the destination. Get there cleanly.

Good luck. Holler at the user if any of the brainstorm questions feel
genuinely ambiguous after reading the references — they care about
this phase landing right and will engage on design questions.