acdream/docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md
Erik 9559726960 fix(render): Phase A8 — pool aliasing in EnvCellRenderer (visual chaos root cause)
The post-Wave-5 indoor branch chaos (flickering, missing walls, GPU 100%,
~10 FPS) is caused by two interconnected pool-management bugs in
EnvCellRenderer that line-by-line WB comparison surfaced in 30 minutes.
Neither was found by the five post-Wave-5 speculative fixes because none
of them inspected the pool path.

Bug #1 — GetPooledList missing list.Clear():
The reuse branch returned pool lists with prior-frame data still inside.
PrepareRenderBatches' merge phase pattern `gfxDict[k] = list; list.AddRange(...)`
assumes empty lists. Without Clear(), lists grow unbounded each frame, GPU
draws cumulative instance counts, and per-instance transforms become a stew
of past + present data. Mirrors WB ObjectRenderManagerBase.cs:1221-1233.

Bug #2 — Render uses snapshot.BatchedByCell.Count instead of PostPreparePoolIndex:
The snapshot author dropped WB's PostPreparePoolIndex field calling it
"scenery-only," then "compensated" in Render by setting _poolIndex to the
cell count. The cell count has no relation to the pool — Prepare may have
used 50+ pool lists for an 18-cell scene. Render's filter-path GetPooledList
then returns lists that ARE in snapshot.BatchedByCell, corrupting the snapshot
mid-Render. Restoring PostPreparePoolIndex (WB VisibilitySnapshot.cs:31)
correctly places Render's pool cursor past the snapshot's owned region.

Bug #3 (minor) — PopulateRecursive hardcoded isSetup:false for nested parts:
Setup IDs use high-byte 0x02 (per retail). WB ObjectRenderManagerBase.cs:813
checks `(partId >> 24) == 0x02` to detect nested Setups. Our port always
passed isSetup:false, silently dropping any nested Setup (its TryGetRenderData
returns IsSetup=true, Render's `!IsSetup` guard skips the draw). Probably
rare in EnvCells but fixed for completeness.

Regression coverage:
- GetPooledList_ReusedList_IsClearedBeforeReturn — would have failed pre-fix
- GetPooledList_FreshList_IsAlwaysEmpty — sanity check
- Snapshot_PostPreparePoolIndex_IsInitSettable — compile-time guarantee
- Snapshot_PostPreparePoolIndex_DefaultsToZero — defensive default

86/86 App tests pass. Build green. The fix is the audit's primary
deliverable; the GL state probe option-1 apparatus follows in a separate
commit as defense-in-depth for any unidentified residual issue.

Full audit + WB cross-reference in
docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 19:08:49 +02:00

9.9 KiB
Raw Blame History

Phase A8 — EnvCellRenderer line-by-line audit findings (2026-05-28 PM)

TL;DR

The post-Wave-5 visual chaos (flickering colors, missing walls, GPU 100%, ~10 FPS) is caused by two interconnected pool-management bugs in src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs. Both were introduced by the WB port author's "simplification" — dropping PostPreparePoolIndex from the snapshot (calling it scenery-only) and dropping list.Clear() from GetPooledList. Either bug alone is enough to corrupt rendering; together they explain every reported symptom precisely.

Neither bug was found by the five post-Wave-5 speculative fixes because none of them inspected the pool-management code path. The bug is in code the subagent wrote that nobody had read line-by-line against WB. The audit pattern in the handoff doc (Process retrospective §3 "Trust-but-verify on subagent work") found it in under 30 minutes once actually applied.

A third minor bug in PopulateRecursive is also documented for completeness.

Bug #1 — GetPooledList is missing list.Clear()

Code as shipped

src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:959-969:

private List<InstanceData> GetPooledList()
{
    lock (_listPool)
    {
        if (_poolIndex < _listPool.Count) return _listPool[_poolIndex++];  // ← MISSING Clear()
        var fresh = new List<InstanceData>();
        _listPool.Add(fresh);
        _poolIndex++;
        return fresh;
    }
}

WB canonical implementation

references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectRenderManagerBase.cs:1221-1233:

protected List<InstanceData> GetPooledList() {
    lock (_listPool) {
        if (_poolIndex < _listPool.Count) {
            var list = _listPool[_poolIndex++];
            list.Clear();   // ← CRITICAL: clears stale data from prior frames
            return list;
        }
        var newList = new List<InstanceData>();
        _listPool.Add(newList);
        _poolIndex++;
        return newList;
    }
}

Effect of the bug

PrepareRenderBatches's merge phase pattern (EnvCellRenderer.cs:494-501) is:

if (!gfxDict.TryGetValue(gfxKvp.Key, out var list))
{
    list = GetPooledList();
    gfxDict[gfxKvp.Key] = list;
}
list.AddRange(gfxKvp.Value);  // ← appends to whatever was in list

Each frame, when GetPooledList returns a previously-used pool list:

  • That list STILL contains prior frames' instance data
  • list.AddRange(gfxKvp.Value) appends the new frame's data on top
  • Lists grow unbounded across frames
  • The GPU receives buffers with N frames' worth of instance data

After ~50 frames at 60Hz, every batch's instance count is ~50× what it should be. The GPU processes 50× more vertices per draw. Hence GPU 100% + low FPS.

Worse: the per-instance transforms in the appended data are STALE — they represent where instances were in prior frames, not now. Hence "flickering colors" (transforms appear/disappear randomly) and "missing walls" (instances whose latest data has been buried under stale entries).

Bug #2 — _poolIndex = snapshot.BatchedByCell.Count instead of snapshot.PostPreparePoolIndex

Code as shipped

src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:614:

lock (_renderLock)
{
    var snapshot = _activeSnapshot;
    _shader.Use();
    _poolIndex = snapshot.BatchedByCell.Count; // reset point (mirrors WB line 405)  ← WRONG

And EnvCellVisibilitySnapshot.cs:11-13 (the snapshot definition):

/// The scenery-side VisibleGroups / VisibleGfxObjIds / IntersectingLandblocks
/// / PostPreparePoolIndex are dropped — we render scenery through
/// WbDrawDispatcher, not through this snapshot.

WB canonical implementation

references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/EnvCellRenderManager.cs:405:

_poolIndex = snapshot.PostPreparePoolIndex;

And references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/VisibilitySnapshot.cs:31:

public int PostPreparePoolIndex { get; init; }

Stored at EnvCellRenderManager.cs:367 during the Prepare→snapshot swap:

_activeSnapshot = new VisibilitySnapshot {
    BatchedByCell = newBatchedByCell,
    ...
    PostPreparePoolIndex = _poolIndex   // ← captured before _poolIndex reset
};

Effect of the bug

The snapshot author's reasoning ("PostPreparePoolIndex is scenery-side, we don't need it") is wrong. PostPreparePoolIndex has nothing to do with scenery — it's the pool-index high-water mark from Prepare's merge phase, which tells Render's filter-path where the pool's safe region begins.

Specifically:

  • After Prepare, _listPool[0..K-1] hold the data referenced by snapshot.BatchedByCell. K = _poolIndex after merge.
  • For Render to call GetPooledList safely (without trampling snapshot data), _poolIndex must be set to K, so new pool calls return _listPool[K..] — past the snapshot's region.

Our broken code sets _poolIndex = BatchedByCell.Count, which has NO RELATION to the pool's high-water mark. For a typical indoor scene at Holtburg with ~5-20 cells visible, BatchedByCell.Count ≈ 18, but Prepare may have used 50-100 pool lists (lists per unique (cellId, gfxObjId) combo).

When Render's filter-path (EnvCellRenderer.cs:684) calls GetPooledList, it returns a list at index 18 — which IS inside snapshot.BatchedByCell. With Bug #1 fixed (Clear added), Clear() wipes that snapshot list. With Bug #1 unfixed, AddRange corrupts it. Either way: snapshot data destroyed mid-Render.

Combined effect — why both bugs together explain the symptoms

Symptom Bug #1 alone Bug #2 alone Both together
GPU 100%, ~10 FPS YES (lists grow per frame) partial YES (compounding)
Flickering colors YES (stale + new transforms mixed) YES (snapshot lists overwritten mid-render) YES (both effects)
Missing walls YES (new instances buried under stale) YES (snapshot data wiped before draw) YES (both effects)
Black diagonal sliver (final report) YES (lists hit memory pressure / driver limits) YES (snapshot corruption produces degenerate transforms) YES

The "ColorMask alpha-bit off" / "cull-cache stale" / "double terrain" / "missing IndoorPass" fixes from the five post-Wave-5 attempts may all be real (some are real bugs, just not THE bug). They didn't help visually because the pool aliasing was always the dominant cause.

Bug #3 (minor) — PopulateRecursive always passes isSetup: false

Code as shipped

src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:358:

foreach (var (partId, partTransform) in rd.SetupParts)
{
    var combined = partTransform * transform;
    PopulateRecursive(group, partId, isSetup: false, combined, cellId);  // ← always false
}

WB canonical implementation

references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectRenderManagerBase.cs:813:

foreach (var (partId, partTransform) in renderData.SetupParts) {
    PopulateRecursive(groups, partId, (partId >> 24) == 0x02, partTransform * transform, cellId, flags);
}

WB detects nested Setups by checking (partId >> 24) == 0x02 (Setup IDs use the 0x02 high byte in retail). Our code always treats child parts as plain GfxObjs.

Effect

For most cells in Holtburg, this doesn't matter — typical cells reference simple GfxObj parts (high byte 0x01). But if any Setup contains a nested Setup, our code adds a stub InstanceData with the nested Setup's ID into the group dict, then TryGetRenderData(nestedSetupId) returns a Setup record, and Render's if (renderData != null && !renderData.IsSetup) check fails → instance silently dropped. Some renderings missing in specific cases but not the catastrophic chaos.

Fixed as a defense-in-depth measure to match WB exactly.

Fix plan

Required (closes Bugs #1 + #2):

  1. EnvCellRenderer.cs:959-969 — Add list.Clear() before return in GetPooledList reuse branch.
  2. EnvCellVisibilitySnapshot.cs — Add public int PostPreparePoolIndex { get; init; }.
  3. EnvCellRenderer.cs:523-534 — Capture _poolIndex into snapshot's PostPreparePoolIndex field during the atomic swap.
  4. EnvCellRenderer.cs:614 — Read snapshot.PostPreparePoolIndex instead of snapshot.BatchedByCell.Count.
  1. EnvCellRenderer.cs:358isSetup: (partId >> 24) == 0x02 to correctly detect nested Setups.

Apparatus (safety net for any unidentified bug):

  1. GL state assertion probe extension — extend EmitDrawOrderProbe to log full GL state at each step boundary and compare against WB-expected state. Lifts cost of finding the NEXT bug (if any) from a multi-hour speculation cycle to a one-launch evidence cycle.

Regression coverage

The pool aliasing bug is hard to unit-test without a GL context (Render is the mutation site, and Render requires a shader and global mesh buffer). A targeted test for the simpler invariant — GetPooledList returns a cleared list — is sufficient: it directly verifies Bug #1's fix and provides a guard against future regressions. Bug #2 is verified by the visual gate.

Process retrospective

The "trust-but-verify subagent work" item in the handoff doc's Process retrospective §3 names exactly this failure mode: the subagent wrote EnvCellRenderer.cs (Wave 2, ~1013 LOC), build-test-green checks passed, but the audit never happened. The five post-Wave-5 speculative fixes were chasing symptoms because the audit step was skipped.

The cost of the missed audit: ~1 hour of speculation across five fix attempts, plus the user's five failed visual gates. The cost of the audit, when actually performed: ~30 minutes to find both root cause bugs via line-by-line comparison.

Rule for subagent-written code touching production paths: always read the diff line-by-line against the cited WB / retail source before merging. Bugs in subagent code that "build clean and pass tests" can still be 100%-wrong on the algorithm — the test coverage was not designed to catch them.