Erik 1834b16cd1 docs(N.5): design spec — bindless + multi-draw indirect on N.4 dispatcher

Brainstormed 2026-05-08 over 8 design questions. Captures:

- Texture model: sampler2DArray for ALL textures (1-layer wrap for
  per-instance composites). Matches WB's modern shader, future-proofs
  for atlas adoption in N.6+.
- Translucency: WB's two-pass alpha-test (no native Additive on GfxObj
  surfaces; falsifiable at visual verification).
- Data delivery: all-SSBO. Instances[] at binding=0, Batches[] at
  binding=1. Indexed by gl_BaseInstanceARB+gl_InstanceID and
  gl_DrawIDARB respectively.
- Bindless residency: resident on upload, never release. Bounded
  content; instrument under ACDREAM_WB_DIAG=1.
- Escape hatch: two-way flag preserved. N.5 replaces N.4's draw method
  in place; legacy InstancedMeshRenderer remains the safety net.
- Perf measurement: CPU stopwatch + GL_TIME_ELAPSED queries, logged
  via [WB-DIAG]. Acceptance gates pasted into SHIP commit.
- Persistent-mapped buffers: deferred to N.6.
- Per-instance highlight (selection blink): deferred; field reserved
  in InstanceData for Phase B.4 follow-up.

Spec at docs/superpowers/specs/2026-05-08-phase-n5-modern-rendering-design.md
covers architecture, components, per-frame data flow walk-through,
translucent rendering, error handling + fallback, testing + acceptance,
risks, and explicit out-of-scope list. Plan + task breakdown comes next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 19:15:30 +02:00

29 KiB

Raw Blame History

Phase N.5 — Modern Rendering Path — Design Spec

Status: Draft (brainstormed 2026-05-08, not yet implemented). Author: acdream lead engineer + Claude. Builds on: Phase N.4 (WbDrawDispatcher, shipped 2026-05-08). Predecessor docs:

docs/research/2026-05-08-phase-n5-handoff.md (cold-start briefing).
docs/superpowers/plans/2026-05-08-phase-n4-rendering-foundation.md (N.4 plan; Adjustments 7-10 are required reading).
docs/superpowers/specs/2026-05-08-phase-n4-rendering-foundation-design.md (N.4 spec).

1. Problem statement

N.4 collapsed entity rendering from O(entities × batches) per-draw GL calls to O(unique GfxObj × surface × translucency) grouped instanced draws. The remaining hot path still does, per group:

glActiveTexture(0)
glBindTexture(2D, texHandle)
glBindBuffer(EBO, batchIbo)
glDrawElementsInstancedBaseVertexBaseInstance(...)

Across a typical Holtburg-courtyard scene that's still ~100-300 GL calls per frame for entities. Modern GPUs and our drivers (GL 4.3 + bindless, gated by WB's _useModernRendering) support patterns that eliminate ALL of those per-group calls:

Bindless textures (GL_ARB_bindless_texture) — texture handles are 64-bit tokens that don't require glBindTexture to use; the shader samples from a handle read out of buffer data.
Multi-draw indirect (glMultiDrawElementsIndirect) — one GL call dispatches N draws from a DrawElementsIndirectCommand buffer; the driver issues all of them with no CPU-side per-draw work.

N.5 lifts WbDrawDispatcher onto these primitives. Target: ≥30% reduction in CPU dispatcher time, draw call count down to ~5/frame, no visual regression vs N.4.

2. Decisions log

This section records the brainstorm outcomes that the rest of the doc relies on.

#	Decision	Choice	Reason
1	Texture sampler model	`sampler2DArray` for ALL textures (1-layer wrapping for per-instance composites)	Matches WB's modern shader exactly; future-proofs for atlas adoption in N.6+; avoids two shader files. ~50 lines of TextureCache change.
2	Translucent rendering	WB's two-pass alpha-test (opaque pass discards `α<0.95`, transparent pass discards `α≥0.95`)	Single blend mode per pass enables one indirect call per pass. Loses native `Additive` blend on GfxObj surfaces; sky + particles have own renderers and aren't affected. Falsifiable at visual verification — if we see a regression, add an additive sub-pass (~30-min fix).
3	Per-instance + per-draw data delivery	All-SSBO: `Instances[]` at binding=0 (mat4 per instance), `Batches[]` at binding=1 (texture handle + layer + flags per group)	Matches WB's modern shader. SSBOs avoid the 16-attrib stride limit, scale to large instance counts, give clean per-draw indexing via `gl_DrawIDARB`.
4	Bindless handle residency	Resident on upload, never release	acdream's content set is bounded (~1-5K unique textures per session). Handles persist for process lifetime; no eviction code in N.5. Diagnostic logging of handle count under `ACDREAM_WB_DIAG=1` to spot growth.
5	Escape hatch	Two-way flag (no change). `ACDREAM_USE_WB_FOUNDATION=0/1` controls `WbFoundationFlag`; flag-on is the N.5 modern path; flag-off falls back to legacy `InstancedMeshRenderer`. N.4's draw method is replaced in place.	N.4's grouped-instanced draw is not preserved as an A/B fallback; legacy `InstancedMeshRenderer` is the existing safety net for "modern rendering broken on this GPU."
6	Perf measurement	CPU stopwatch + GL timer queries logged via `[WB-DIAG]`	Captures both CPU dispatcher time and GPU rendering time. Acceptance gate compares before/after numbers in fixed Holtburg/Foundry scenes.
7	Persistent-mapped buffers	Defer to N.6	Bindless+indirect win is 70-80% of achievable savings. Persistent-mapped + ring + sync is the last 5-10% with non-trivial sync-fence complexity; not worth the risk in N.5's 2-3 week budget. Add post-N.5 if profiling shows residual `glBufferData` cost.
8	Per-instance highlight (selection blink)	Defer to a Phase B.4 follow-up	Retail pulses click targets as visual confirmation; the right mechanism is per-instance highlight color (NOT WB's global `uHighlightColor` which would tint everything in our single-indirect-call design). Field is reserved in design (extend `InstanceData` to include `vec4 highlightColor`); N.5 ships without the field, future phase plumbs it without shader rewrite.

3. Architecture overview

What changes

WbDrawDispatcher.Draw swaps its inner loop. Phases 1-3 (entity walk, group bucketing, matrix layout) stay intact. Phases 5-6 (per-group GL calls) are replaced by a single glMultiDrawElementsIndirect per pass, fed by SSBO-resident per-instance and per-draw data.

What's preserved from N.4

Group bucketing pipeline (entity AABB cull, palette hash memo, group key dictionary).
AcSurfaceMetadataTable for translucency classification.
EntitySpawnAdapter / LandblockSpawnAdapter (mesh lifecycle bridge).
WbMeshAdapter (the seam over WB's ObjectMeshManager).
Front-to-back sort of opaque groups (depth-test reject of overdrawn fragments).
Per-entity 5m AABB frustum cull.

What's new

TextureCache uploads as 1-layer Texture2DArray instead of Texture2D. Generates 64-bit bindless handles at upload, makes them resident.
New shader pair mesh_modern.vert/.frag modeled on WB's StaticObjectModern but adapted (see §6).
Three new GPU buffers in the dispatcher:
- _instanceSsbo — std430 layout, mat4[], all visible matrices.
- _batchSsbo — std430 layout, BatchData[], one entry per group.
- _indirectBuffer — DrawElementsIndirectCommand[], one per group.
Two diagnostic measurements in [WB-DIAG]: CPU stopwatch span around Draw(); GPU GL_TIME_ELAPSED query around the indirect dispatch.

What gets deleted

WbDrawDispatcher.DrawGroup (replaced by indirect).
WbDrawDispatcher.EnsureInstanceAttribs (no more vertex attribs at locations 3-6).
Per-blend-mode glBlendFunc switch in the translucent loop.
mesh_instanced.vert/.frag (replaced by mesh_modern.*).

What stays under the escape hatch

InstancedMeshRenderer is untouched. ACDREAM_USE_WB_FOUNDATION=0 still routes there. N.6 retires it.

4. Component changes

4.1 `TextureCache`

Texture upload path becomes Texture2DArray with depth=1:

private uint UploadRgba8AsLayer1Array(DecodedTexture decoded)
{
    uint tex = _gl.GenTexture();
    _gl.BindTexture(TextureTarget.Texture2DArray, tex);

    fixed (byte* p = decoded.Rgba8)
        _gl.TexImage3D(
            TextureTarget.Texture2DArray, 0, InternalFormat.Rgba8,
            (uint)decoded.Width, (uint)decoded.Height, depth: 1,
            border: 0, PixelFormat.Rgba, PixelType.UnsignedByte, p);

    _gl.TexParameter(TextureTarget.Texture2DArray, TextureParameterName.TextureMinFilter, (int)TextureMinFilter.Linear);
    _gl.TexParameter(TextureTarget.Texture2DArray, TextureParameterName.TextureMagFilter, (int)TextureMagFilter.Linear);
    _gl.TexParameter(TextureTarget.Texture2DArray, TextureParameterName.TextureWrapS,     (int)TextureWrapMode.Repeat);
    _gl.TexParameter(TextureTarget.Texture2DArray, TextureParameterName.TextureWrapT,     (int)TextureWrapMode.Repeat);
    _gl.BindTexture(TextureTarget.Texture2DArray, 0);
    return tex;
}

Bindless handle generation, eager + resident-on-upload, parallel cache:

private readonly Dictionary<uint, ulong> _bindlessHandlesByGlName = new();

private ulong MakeResidentHandle(uint glTextureName)
{
    if (_bindlessHandlesByGlName.TryGetValue(glTextureName, out var h))
        return h;
    h = _bindless.GetTextureHandleARB(glTextureName);
    _bindless.MakeTextureHandleResidentARB(h);
    _bindlessHandlesByGlName[glTextureName] = h;
    return h;
}

Three new methods returning ulong bindless handles, paralleling the existing uint GL-name methods:

public ulong GetOrUploadBindless(uint surfaceId);
public ulong GetOrUploadWithOrigTextureOverrideBindless(uint surfaceId, uint overrideOrigTextureId);
public ulong GetOrUploadWithPaletteOverrideBindless(uint surfaceId, uint? overrideOrigTextureId, PaletteOverride paletteOverride, ulong precomputedPaletteHash);

Each delegates to its existing uint sibling to populate the underlying GL texture, then calls MakeResidentHandle and returns the 64-bit handle.

The uint-returning methods stay (used by SkyRenderer, TerrainAtlas, anything outside the WB modern path).

Dispose releases bindless handles BEFORE deleting their textures: iterate _bindlessHandlesByGlName.Values, call glMakeTextureHandleNonResidentARB(handle), then glDeleteTextures proceeds as today.

4.2 `WbDrawDispatcher`

Three new GPU buffers (replacing _instanceVbo):

private uint _instanceSsbo;     // binding=0, std430, mat4[]
private uint _batchSsbo;        // binding=1, std430, BatchData[]
private uint _indirectBuffer;   // GL_DRAW_INDIRECT_BUFFER, DEIC[]

InstanceGroup becomes:

private sealed class InstanceGroup
{
    public uint Ibo;
    public uint FirstIndex;
    public int BaseVertex;
    public int IndexCount;
    public ulong BindlessTextureHandle;   // 64-bit (was uint TextureHandle in N.4)
    public uint TextureLayer;             // always 0 in N.5 (per-instance composites are 1-layer arrays)
    public TranslucencyKind Translucency;
    public int FirstInstance;
    public int InstanceCount;
    public float SortDistance;
    public readonly List<Matrix4x4> Matrices = new();
}

GroupKey adds the layer:

private readonly record struct GroupKey(
    uint Ibo, uint FirstIndex, int BaseVertex, int IndexCount,
    ulong BindlessTextureHandle, uint TextureLayer, TranslucencyKind Translucency);

Per-frame draw flow:

Walk entities → build _groups dict (unchanged from N.4).
Lay matrices contiguously, split opaque/transparent, sort opaque (unchanged).
Build per-group BatchData and DEIC arrays. One BatchData per group (handle, layer, flags=0). One DEIC per group (count = IndexCount, instanceCount = InstanceCount, firstIndex = FirstIndex, baseVertex = BaseVertex, baseInstance = FirstInstance). Indirect commands are laid out contiguously: opaque section first (sorted front-to-back), transparent section second. _opaqueDrawCount and _transparentDrawCount track section sizes; _transparentByteOffset = _opaqueDrawCount * sizeof(DEIC).
Three glBufferData uploads to _instanceSsbo, _batchSsbo, _indirectBuffer (single buffer, both sections).
Bind global VAO once (preserved from N.4 — modern rendering shares one VAO).
Bind SSBOs once via glBindBufferBase(SHADER_STORAGE_BUFFER, 0, _instanceSsbo) and ... 1, _batchSsbo.
Opaque pass. Set uRenderPass = 0. glBindBuffer(DRAW_INDIRECT_BUFFER, _indirectBuffer). glMultiDrawElementsIndirect(Triangles, UnsignedShort, indirect=(void*)0, drawcount=_opaqueDrawCount, stride=sizeof(DEIC)).
Transparent pass. Set uRenderPass = 1. glEnable(BLEND) + glBlendFunc(SrcAlpha, OneMinusSrcAlpha) + glDepthMask(false). glMultiDrawElementsIndirect(Triangles, UnsignedShort, indirect=(void*)_transparentByteOffset, drawcount=_transparentDrawCount, stride=sizeof(DEIC)).
Restore state. glDepthMask(true) + glDisable(BLEND) + glBindVertexArray(0).

Diagnostic timing (under ACDREAM_WB_DIAG=1):

CPU: Stopwatch started at the top of Draw(), stopped at the bottom. Median + 95th-percentile flushed in the 5-second [WB-DIAG] rollup.
GPU: glGenQueries two query objects (one for opaque, one for transparent). glBeginQuery(TIME_ELAPSED) / glEndQuery around each glMultiDrawElementsIndirect. Result polled with GL_QUERY_RESULT_NO_WAIT on the next frame's start; if not ready, drop the sample and try again.

4.3 New shader files

src/AcDream.App/Shaders/mesh_modern.vert:

#version 430 core
#extension GL_ARB_bindless_texture : require
#extension GL_ARB_shader_draw_parameters : require

layout(location = 0) in vec3 aPosition;
layout(location = 1) in vec3 aNormal;
layout(location = 2) in vec2 aTexCoord;

struct InstanceData {
    mat4 transform;
    // Reserved for Phase B.4 follow-up (selection-blink retail-faithful highlight):
    //   vec4 highlightColor;  // RGBA — when non-zero alpha, fragment shader mixes into output.
    // Add field here, increase stride to 80 bytes, and read at fragment via flat varying.
};

struct BatchData {
    uvec2 textureHandle;   // bindless handle for sampler2DArray
    uint  textureLayer;    // layer index (always 0 for per-instance composites)
    uint  flags;           // reserved for future use
};

layout(std430, binding = 0) readonly buffer InstanceBuffer {
    InstanceData Instances[];
};

layout(std430, binding = 1) readonly buffer BatchBuffer {
    BatchData Batches[];
};

layout(std140, binding = 1) uniform LightingUbo {
    vec4 uAmbient;
    vec4 uSunDir;
    vec4 uSunColor;
    // matches existing acdream lighting UBO; do not change layout
};

uniform mat4 uViewProjection;
uniform int  uRenderPass;     // 0=opaque, 1=transparent (consumed in fragment shader)

out vec3 vNormal;
out vec2 vTexCoord;
out flat uvec2 vTextureHandle;
out flat uint  vTextureLayer;

void main() {
    int instanceIndex = gl_BaseInstanceARB + gl_InstanceID;
    mat4 model = Instances[instanceIndex].transform;

    vec4 worldPos = model * vec4(aPosition, 1.0);
    gl_Position = uViewProjection * worldPos;

    vNormal = normalize(mat3(model) * aNormal);
    vTexCoord = aTexCoord;

    BatchData b = Batches[gl_DrawIDARB];
    vTextureHandle = b.textureHandle;
    vTextureLayer = b.textureLayer;
}

src/AcDream.App/Shaders/mesh_modern.frag:

#version 430 core
#extension GL_ARB_bindless_texture : require

in vec3 vNormal;
in vec2 vTexCoord;
in flat uvec2 vTextureHandle;
in flat uint  vTextureLayer;

layout(std140, binding = 1) uniform LightingUbo {
    vec4 uAmbient;
    vec4 uSunDir;
    vec4 uSunColor;
};

uniform int uRenderPass;

out vec4 FragColor;

void main() {
    sampler2DArray tex = sampler2DArray(vTextureHandle);
    vec4 color = texture(tex, vec3(vTexCoord, float(vTextureLayer)));

    if (uRenderPass == 0) {
        // Opaque pass: discard soft pixels (alpha cutout), write to depth
        if (color.a < 0.95) discard;
    } else {
        // Transparent pass: discard hard pixels (already drawn opaque), no depth write
        if (color.a >= 0.95) discard;
        if (color.a < 0.05) discard;  // skip totally-empty fragments — perf for large transparent overdraw
    }

    // Diffuse lighting (preserved from acdream's existing lighting model)
    vec3 N = normalize(vNormal);
    vec3 L = normalize(uSunDir.xyz);
    float diff = max(dot(N, L), 0.0);
    vec3 lit = uAmbient.rgb + uSunColor.rgb * diff;
    color.rgb *= clamp(lit, 0.0, 1.0);

    FragColor = color;
}

Differences from WB's StaticObjectModern.*:

Drops uActiveCells[] cell-filtering (acdream culls cells on CPU).
Drops uDrawIDOffset (acdream issues full passes, no pagination).
Drops uHighlightColor (deferred to Phase B.4 follow-up; reserved as per-instance highlightColor field, not a global uniform).
Adapts the lighting model to acdream's existing UBO at binding=1 instead of WB's SceneData UBO.
Uses 1-layer sampler2DArray for ALL textures (WB uses multi-layer atlases — same shader works for both shapes).

5. Per-frame data flow walk-through

A concrete trace. Visible work for frame N:

Group	GfxObj	Surface	Translucency	Instances
0	oak tree	bark	Opaque	12
1	oak tree	leaves	AlphaBlend	12
2	drudge	skin (palette override)	Opaque	1
3	drudge	eyes	Opaque	1

Instance SSBO (binding=0), 26 entries (each batch contributes its own copy of the entity matrix):

[0..11]   = oak instance matrices (group 0 — bark)
[12..23]  = oak instance matrices (group 1 — leaves)
[24]      = drudge instance matrix (group 2 — skin)
[25]      = drudge instance matrix (group 3 — eyes)

Batch SSBO (binding=1), 4 entries indexed by gl_DrawIDARB:

Batches[0] = (oak_bark_handle,   layer=0, flags=0)
Batches[1] = (oak_leaves_handle, layer=0, flags=0)
Batches[2] = (drudge_skin_handle_with_palette, layer=0, flags=0)
Batches[3] = (drudge_eyes_handle, layer=0, flags=0)

Indirect buffer (single buffer, two sections):

_indirectBuffer[0..2]  = opaque section (3 entries, sorted front-to-back)
  [0] = (count=oakBarkIdx, instanceCount=12, firstIndex=oakBarkFI, baseVertex=oakBV, baseInstance=0)
  [1] = (count=drudgeSkinIdx, instanceCount=1, firstIndex=drudgeSkinFI, baseVertex=drudgeBV, baseInstance=24)
  [2] = (count=drudgeEyesIdx, instanceCount=1, firstIndex=drudgeEyesFI, baseVertex=drudgeBV, baseInstance=25)

_indirectBuffer[3]     = transparent section (1 entry)
  [3] = (count=oakLeavesIdx, instanceCount=12, firstIndex=oakLeavesFI, baseVertex=oakBV, baseInstance=12)

_opaqueDrawCount = 3; _transparentDrawCount = 1; _transparentByteOffset = 3 * sizeof(DEIC) = 60.

Shader access pattern (per vertex):

int instanceIndex = gl_BaseInstanceARB + gl_InstanceID;     // unique per (group, instance) pair
mat4 model = Instances[instanceIndex].transform;
BatchData b = Batches[gl_DrawIDARB];                         // shared across all verts in this draw
sampler2DArray tex = sampler2DArray(b.textureHandle);
vec4 color = texture(tex, vec3(aTexCoord, float(b.textureLayer)));

Per-frame CPU GL calls (entity rendering, total):

3× glBufferData (instance SSBO, batch SSBO, indirect buffer).
1× glBindVertexArray(globalVAO).
2× glBindBufferBase (SSBOs at bindings 0 + 1).
1× glBindBuffer(DRAW_INDIRECT_BUFFER, _indirectBuffer).
2× glMultiDrawElementsIndirect (one opaque, one transparent).
~5 state changes (blend, depth mask, render pass uniform).

Total: ~15-20 GL calls per frame for entity rendering, regardless of group count. N.4 baseline is "few hundred."

6. Translucent rendering detail

Per Decision 2: WB's two-pass alpha-test pattern.

Group classification. ClassifyBatches puts groups into one of two arrays:

Opaque indirect: TranslucencyKind.Opaque and TranslucencyKind.ClipMap.
Transparent indirect: TranslucencyKind.AlphaBlend, Additive, InvAlpha all merged. Per Decision 2, additive renders as alpha-blend; falsifiable at visual verification.

Opaque groups stay sorted front-to-back by SortDistance (preserved from N.4 — depth-test reject of overdrawn fragments is a meaningful win on dense scenes).

Pass GL state:

// Opaque pass
_gl.Disable(EnableCap.Blend);
_gl.DepthMask(true);
_gl.Enable(EnableCap.CullFace); _gl.CullFace(TriangleFace.Back); _gl.FrontFace(FrontFaceDirection.Ccw);
_shader.SetInt("uRenderPass", 0);
_gl.BindBuffer(BufferTargetARB.DrawIndirectBuffer, _indirectBuffer);
_gl.MultiDrawElementsIndirect(PrimitiveType.Triangles, DrawElementsType.UnsignedShort,
    indirect: (void*)0, drawcount: _opaqueDrawCount, stride: (uint)sizeof(DEIC));

// Transparent pass
_gl.Enable(EnableCap.Blend);
_gl.BlendFunc(BlendingFactor.SrcAlpha, BlendingFactor.OneMinusSrcAlpha);
_gl.DepthMask(false);
_shader.SetInt("uRenderPass", 1);
_gl.MultiDrawElementsIndirect(PrimitiveType.Triangles, DrawElementsType.UnsignedShort,
    indirect: (void*)_transparentByteOffset, drawcount: _transparentDrawCount, stride: (uint)sizeof(DEIC));

// Cleanup
_gl.DepthMask(true); _gl.Disable(EnableCap.Blend); _gl.BindVertexArray(0);

Visual verification gate (additive fallback plan). During Week 2-3 visual verification, look at:

Holtburg courtyard, dungeon entrance — confirm scenery + characters identical.
Foundry interior — magic-themed content with potentially additive-flagged surfaces.
Any glowing weapon decals, magical aura effects, or self-luminous textures observed.

If a visible regression appears (faded glow, missing additive bloom): amend spec to add a third indirect call within the transparent pass with glBlendFunc(SrcAlpha, One). Group classification splits Additive into its own bucket. ~30-min change.

7. Error handling and fallback

7.1 GPU capability detection

WB's OpenGLGraphicsDevice already detects:

HasOpenGL43 (required for SSBOs, multi-draw indirect, gl_BaseInstanceARB).
HasBindless (required for bindless texture handles).

WbDrawDispatcher is only constructed when WbFoundationFlag.Enabled is true, which gates on _useModernRendering = HasOpenGL43 && HasBindless. We inherit WB's gating.

Additional check: GL_ARB_shader_draw_parameters (for gl_BaseInstanceARB, gl_DrawIDARB). Standard on GL 4.6, available as extension on 4.3+. Add to N.5's capability check; if missing, WbDrawDispatcher constructor logs a one-time warning and the foundation flag flips off (falls back to InstancedMeshRenderer).

7.2 Shader compile failure

If mesh_modern.vert/.frag fails to compile (driver bug, GLSL version mismatch, extension issue): catch the compile exception in WbDrawDispatcher constructor, log the GLSL info log + GPU vendor/renderer string ONCE, flip WbFoundationFlag.Enabled = false for the session, fall back to InstancedMeshRenderer. Do not crash.

7.3 Non-resident handle (the bindless foot-gun)

Sampling a non-resident handle causes undefined behavior (driver-dependent: black texture, GPU fault, device-lost).

Mitigation in code: TextureCache.MakeResidentHandle is the only API that produces a handle, and it makes the handle resident in the same call. There is no API surface that produces a non-resident handle. Defense-in-depth: dispatcher asserts BindlessTextureHandle != 0 before queuing a draw (zero handles get filtered out, same as zero surfaceId does today).

7.4 Indirect command corruption

count, firstIndex, baseVertex come from WB's ObjectRenderBatch (never user input; WB-internal correctness). instanceCount is grp.Matrices.Count (we control). baseInstance is grp.FirstInstance (we control, computed cumulatively). Bug-class is "WB-internal corruption + our cumulative-offset bug" — same surface area as N.4's BaseInstance already trusts. Add a debug-build assertion: cumulative baseInstance values must be strictly increasing.

7.5 Disposal order

WbDrawDispatcher.Dispose releases bindless handles before deleting underlying textures (driver UB otherwise). TextureCache.Dispose does this:

Iterate _bindlessHandlesByGlName.Values, call glMakeTextureHandleNonResidentARB(handle).
Call _glExtensions.MakeAllNonResidentARB if available (some drivers prefer batch).
Then glDeleteTextures proceeds as today.

Dispatcher's own buffer cleanup (_instanceSsbo, _batchSsbo, _indirectBuffer) via glDeleteBuffers.

7.6 Persistent first-failure diagnostic

If shader compile fails OR an extension check fails OR glMultiDrawElementsIndirect returns GL_INVALID_OPERATION on first frame: log ONCE with GPU vendor/renderer string + GLSL info log. Don't spam. User pastes the line into a bug report; we know exactly where to look.

8. Testing and acceptance

8.1 Unit / conformance tests

TextureCacheBindlessTests — for each Bindless-suffixed GetOrUpload*: returns non-zero ulong, returns same handle for same key (cache hit), distinct keys yield distinct handles, returned handle is resident per GL state query.
WbDrawDispatcherIndirectBuilderTests — pure CPU test: given a fixture of (entity, mesh, batch) tuples, verify the indirect buffer layout: count / firstIndex / baseVertex / baseInstance per group, opaque section sorted front-to-back, transparent section in classification order (no sort — back-to-front sort can be added in a follow-up if measured useful).
WbDrawDispatcherTranslucencyTests — verify groups land in correct indirect buffer (opaque vs transparent) per TranslucencyKind. Additive/InvAlpha go to transparent. ClipMap goes to opaque. Empty groups skipped.
Existing N.4 tests stay green. All 60 tests captured by FullyQualifiedName~Wb|MatrixComposition filter remain at 60/0.

8.2 Visual verification

Same gate as N.4 used. Live ACE + retail dat, in-world testing.

Holtburg courtyard — characters + scenery + buildings render identically to N.4. No missing entities, no z-fighting, no exploded parts.
Foundry interior — dense static-object scene, stress-tests indirect call count and translucency classification.
Indoor → outdoor cell transition — confirms cell visibility filtering still works (we cull on CPU; dispatcher should never see invisible-cell entities).
Drudge / character close-up — confirms Issue #47 close-detail mesh preservation.
Magic content (additive fallback check) — Foundry runes, glowing weapons if observable, boss models with luminous decals. Trigger spec amendment if regression spotted.

User-confirms each. These are visual identity checks against the running N.4 behavior (use git stash of N.5 changes + relaunch as the comparison baseline).

8.3 Perf measurement (the win gate)

[WB-DIAG] augmented:

[WB-DIAG] entSeen=N entDrawn=M ... drawsIssued=K groups=G  (existing)
[WB-DIAG] cpu_us=Xmedian/Y95p gpu_us=Zmedian/W95p          (new)

Capture before/after numbers in fixed scenes/cameras:

Scene	Camera position	Metric
Holtburg courtyard	30m elevated, looking SW	`cpu`, `gpu`, `drawsIssued`
Foundry interior	character spawn, default heading	`cpu`, `gpu`, `drawsIssued`
Open landscape	terrain wander, no entities	`cpu`, `gpu`, `drawsIssued` (sanity)

Acceptance gates (paste into SHIP commit message):

Visual identity to N.4 — confirmed via §8.2.
CPU dispatcher time ≤ 70% of N.4 in Holtburg courtyard (target: ≥30% reduction).
GPU rendering time within ±10% of N.4 (sanity: no regression).
drawsIssued ≤ 5 per pass (down from "few hundred per pass").
All tests green — 60+ Wb tests + new bindless/indirect tests.
ACDREAM_USE_WB_FOUNDATION=0 still works — InstancedMeshRenderer fallback runs and renders correctly.

8.4 Long-session sanity check

Hour-long session with ACDREAM_WB_DIAG=1. Watch resident-handle count grow. Expected: bounded plateau under 5K once content set is fully traversed. If unbounded growth, residency policy revisit required in N.6.

9. Risks

Risk	Likelihood	Impact	Mitigation
Driver bug in bindless residency	Low (mature in 2025+ drivers)	Crash / black textures	One-time logging on first failure; legacy fallback under flag-off
Driver bug in `glMultiDrawElementsIndirect`	Low	GL_INVALID_OPERATION	Capability check + first-failure logging + fallback
Resident handle count exceeds driver limit in long session	Low (acdream content is bounded)	Cumulative GPU memory pressure → eventual eviction surprises	`[WB-DIAG]` resident-count log; revisit eviction in N.6 if it grows unbounded
Shader compile fails on weird GPU	Medium-low	First-launch failure	Compile-error catch + fallback to `InstancedMeshRenderer`
Additive fidelity regression on rare GfxObj surfaces	Medium	Subtle visual difference	Visual verification at magic-themed content; spec amendment for additive sub-pass if found
`gl_BaseInstanceARB` fields not advancing per-instance attribs we still use	Low (we drop attribs entirely)	Wrong matrices	All instance data via SSBO; no vertex attrib at locations 3-6 to misalign
SSBO indexing GPU cost worse than uniform-array	Low (well-optimized in modern drivers)	Possible GPU time regression	GL timer queries detect; if observed, fall back to uniform array of bounded size
Persistent-mapped buffer foot-guns (chosen NOT to use in N.5)	n/a	n/a	Decision 7 defers to N.6
Per-instance highlight (selection blink) feature creep	Low	Scope grows	Decision 8 defers; field reserved in design doc

10. Out of scope (explicitly)

The following are NOT N.5 work. They become possible follow-ons.

WB's TextureAtlasManager adoption for atlas tier. N.5 keeps acdream's TextureCache as the texture owner for everything. Atlas adoption is N.6+ if memory pressure shows up.
Persistent-mapped buffer ring with sync fences. Decision 7. N.6 candidate if profiling shows residual glBufferData cost.
GPU-side culling (compute pre-pass). Future phase.
Texture array repacking for multi-layer per-instance composites. Future, if many palette-overrides actually share dimensions and could be packed.
Selection-blink highlight color. Decision 8. Phase B.4 follow-up. Field reserved in InstanceData design (extend stride to 80 bytes when implementing).
Deletion of legacy InstancedMeshRenderer. N.6.
Terrain wiring through WB. Future.

11. Open questions

None outstanding. All 8 brainstorm questions resolved + 1 clarification on highlight semantics. Ready for plan.

End of design.

29 KiB Raw Blame History Unescape Escape