fix(render): Phase A8 — pool aliasing in EnvCellRenderer (visual chaos root cause)

The post-Wave-5 indoor branch chaos (flickering, missing walls, GPU 100%, ~10 FPS) is caused by two interconnected pool-management bugs in EnvCellRenderer that line-by-line WB comparison surfaced in 30 minutes. Neither was found by the five post-Wave-5 speculative fixes because none of them inspected the pool path. Bug #1 — GetPooledList missing list.Clear(): The reuse branch returned pool lists with prior-frame data still inside. PrepareRenderBatches' merge phase pattern `gfxDict[k] = list; list.AddRange(...)` assumes empty lists. Without Clear(), lists grow unbounded each frame, GPU draws cumulative instance counts, and per-instance transforms become a stew of past + present data. Mirrors WB ObjectRenderManagerBase.cs:1221-1233. Bug #2 — Render uses snapshot.BatchedByCell.Count instead of PostPreparePoolIndex: The snapshot author dropped WB's PostPreparePoolIndex field calling it "scenery-only," then "compensated" in Render by setting _poolIndex to the cell count. The cell count has no relation to the pool — Prepare may have used 50+ pool lists for an 18-cell scene. Render's filter-path GetPooledList then returns lists that ARE in snapshot.BatchedByCell, corrupting the snapshot mid-Render. Restoring PostPreparePoolIndex (WB VisibilitySnapshot.cs:31) correctly places Render's pool cursor past the snapshot's owned region. Bug #3 (minor) — PopulateRecursive hardcoded isSetup:false for nested parts: Setup IDs use high-byte 0x02 (per retail). WB ObjectRenderManagerBase.cs:813 checks `(partId >> 24) == 0x02` to detect nested Setups. Our port always passed isSetup:false, silently dropping any nested Setup (its TryGetRenderData returns IsSetup=true, Render's `!IsSetup` guard skips the draw). Probably rare in EnvCells but fixed for completeness. Regression coverage: - GetPooledList_ReusedList_IsClearedBeforeReturn — would have failed pre-fix - GetPooledList_FreshList_IsAlwaysEmpty — sanity check - Snapshot_PostPreparePoolIndex_IsInitSettable — compile-time guarantee - Snapshot_PostPreparePoolIndex_DefaultsToZero — defensive default 86/86 App tests pass. Build green. The fix is the audit's primary deliverable; the GL state probe option-1 apparatus follows in a separate commit as defense-in-depth for any unidentified residual issue. Full audit + WB cross-reference in docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 19:08:49 +02:00 · 2026-05-27 19:08:49 +02:00 · 9559726960
commit 9559726960
parent 3d0ffaa794
4 changed files with 408 additions and 9 deletions
--- a/docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md
+++ b/docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md
@ -0,0 +1,260 @@
+# Phase A8 — EnvCellRenderer line-by-line audit findings (2026-05-28 PM)
+
+## TL;DR
+
+The post-Wave-5 visual chaos (flickering colors, missing walls, GPU 100%, ~10 FPS)
+is caused by **two interconnected pool-management bugs** in
+`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs`. Both were introduced by the
+WB port author's "simplification" — dropping `PostPreparePoolIndex` from the
+snapshot (calling it scenery-only) and dropping `list.Clear()` from
+`GetPooledList`. Either bug alone is enough to corrupt rendering; together they
+explain every reported symptom precisely.
+
+Neither bug was found by the five post-Wave-5 speculative fixes because none
+of them inspected the pool-management code path. The bug is in code the
+subagent wrote that nobody had read line-by-line against WB. The audit pattern
+in the handoff doc (Process retrospective §3 "Trust-but-verify on subagent
+work") found it in under 30 minutes once actually applied.
+
+A third minor bug in `PopulateRecursive` is also documented for completeness.
+
+## Bug #1 — `GetPooledList` is missing `list.Clear()`
+
+### Code as shipped
+
+`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:959-969`:
+
+```csharp
+private List<InstanceData> GetPooledList()
+{
+    lock (_listPool)
+    {
+        if (_poolIndex < _listPool.Count) return _listPool[_poolIndex++];  // ← MISSING Clear()
+        var fresh = new List<InstanceData>();
+        _listPool.Add(fresh);
+        _poolIndex++;
+        return fresh;
+    }
+}
+```
+
+### WB canonical implementation
+
+`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectRenderManagerBase.cs:1221-1233`:
+
+```csharp
+protected List<InstanceData> GetPooledList() {
+    lock (_listPool) {
+        if (_poolIndex < _listPool.Count) {
+            var list = _listPool[_poolIndex++];
+            list.Clear();   // ← CRITICAL: clears stale data from prior frames
+            return list;
+        }
+        var newList = new List<InstanceData>();
+        _listPool.Add(newList);
+        _poolIndex++;
+        return newList;
+    }
+}
+```
+
+### Effect of the bug
+
+`PrepareRenderBatches`'s merge phase pattern (`EnvCellRenderer.cs:494-501`) is:
+
+```csharp
+if (!gfxDict.TryGetValue(gfxKvp.Key, out var list))
+{
+    list = GetPooledList();
+    gfxDict[gfxKvp.Key] = list;
+}
+list.AddRange(gfxKvp.Value);  // ← appends to whatever was in list
+```
+
+Each frame, when `GetPooledList` returns a previously-used pool list:
+- That list STILL contains prior frames' instance data
+- `list.AddRange(gfxKvp.Value)` appends the new frame's data on top
+- Lists grow unbounded across frames
+- The GPU receives buffers with N frames' worth of instance data
+
+After ~50 frames at 60Hz, every batch's instance count is ~50× what it should
+be. The GPU processes 50× more vertices per draw. Hence GPU 100% + low FPS.
+
+Worse: the per-instance transforms in the appended data are STALE — they
+represent where instances were in prior frames, not now. Hence "flickering
+colors" (transforms appear/disappear randomly) and "missing walls" (instances
+whose latest data has been buried under stale entries).
+
+## Bug #2 — `_poolIndex = snapshot.BatchedByCell.Count` instead of `snapshot.PostPreparePoolIndex`
+
+### Code as shipped
+
+`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:614`:
+
+```csharp
+lock (_renderLock)
+{
+    var snapshot = _activeSnapshot;
+    _shader.Use();
+    _poolIndex = snapshot.BatchedByCell.Count; // reset point (mirrors WB line 405)  ← WRONG
+```
+
+And `EnvCellVisibilitySnapshot.cs:11-13` (the snapshot definition):
+
+```csharp
+/// The scenery-side VisibleGroups / VisibleGfxObjIds / IntersectingLandblocks
+/// / PostPreparePoolIndex are dropped — we render scenery through
+/// WbDrawDispatcher, not through this snapshot.
+```
+
+### WB canonical implementation
+
+`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/EnvCellRenderManager.cs:405`:
+
+```csharp
+_poolIndex = snapshot.PostPreparePoolIndex;
+```
+
+And `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/VisibilitySnapshot.cs:31`:
+
+```csharp
+public int PostPreparePoolIndex { get; init; }
+```
+
+Stored at `EnvCellRenderManager.cs:367` during the Prepare→snapshot swap:
+
+```csharp
+_activeSnapshot = new VisibilitySnapshot {
+    BatchedByCell = newBatchedByCell,
+    ...
+    PostPreparePoolIndex = _poolIndex   // ← captured before _poolIndex reset
+};
+```
+
+### Effect of the bug
+
+The snapshot author's reasoning ("PostPreparePoolIndex is scenery-side, we
+don't need it") is **wrong**. PostPreparePoolIndex has nothing to do with
+scenery — it's the pool-index high-water mark from Prepare's merge phase,
+which tells Render's filter-path where the pool's safe region begins.
+
+Specifically:
+- After Prepare, `_listPool[0..K-1]` hold the data referenced by
+  `snapshot.BatchedByCell`. K = `_poolIndex` after merge.
+- For Render to call `GetPooledList` safely (without trampling
+  snapshot data), `_poolIndex` must be set to K, so new pool calls return
+  `_listPool[K..]` — past the snapshot's region.
+
+Our broken code sets `_poolIndex = BatchedByCell.Count`, which has NO RELATION
+to the pool's high-water mark. For a typical indoor scene at Holtburg with
+~5-20 cells visible, `BatchedByCell.Count` ≈ 18, but Prepare may have used
+50-100 pool lists (lists per unique (cellId, gfxObjId) combo).
+
+When Render's filter-path (`EnvCellRenderer.cs:684`) calls `GetPooledList`,
+it returns a list at index 18 — which IS inside `snapshot.BatchedByCell`. With
+Bug #1 fixed (Clear added), `Clear()` wipes that snapshot list. With Bug #1
+unfixed, `AddRange` corrupts it. Either way: snapshot data destroyed mid-Render.
+
+## Combined effect — why both bugs together explain the symptoms
+
+| Symptom | Bug #1 alone | Bug #2 alone | Both together |
+|---|---|---|---|
+| GPU 100%, ~10 FPS | YES (lists grow per frame) | partial | YES (compounding) |
+| Flickering colors | YES (stale + new transforms mixed) | YES (snapshot lists overwritten mid-render) | YES (both effects) |
+| Missing walls | YES (new instances buried under stale) | YES (snapshot data wiped before draw) | YES (both effects) |
+| Black diagonal sliver (final report) | YES (lists hit memory pressure / driver limits) | YES (snapshot corruption produces degenerate transforms) | YES |
+
+The "ColorMask alpha-bit off" / "cull-cache stale" / "double terrain" / "missing
+IndoorPass" fixes from the five post-Wave-5 attempts may all be real (some are
+real bugs, just not THE bug). They didn't help visually because the pool
+aliasing was always the dominant cause.
+
+## Bug #3 (minor) — `PopulateRecursive` always passes `isSetup: false`
+
+### Code as shipped
+
+`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:358`:
+
+```csharp
+foreach (var (partId, partTransform) in rd.SetupParts)
+{
+    var combined = partTransform * transform;
+    PopulateRecursive(group, partId, isSetup: false, combined, cellId);  // ← always false
+}
+```
+
+### WB canonical implementation
+
+`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectRenderManagerBase.cs:813`:
+
+```csharp
+foreach (var (partId, partTransform) in renderData.SetupParts) {
+    PopulateRecursive(groups, partId, (partId >> 24) == 0x02, partTransform * transform, cellId, flags);
+}
+```
+
+WB detects nested Setups by checking `(partId >> 24) == 0x02` (Setup IDs use
+the 0x02 high byte in retail). Our code always treats child parts as plain
+GfxObjs.
+
+### Effect
+
+For most cells in Holtburg, this doesn't matter — typical cells reference
+simple GfxObj parts (high byte 0x01). But if any Setup contains a nested
+Setup, our code adds a stub InstanceData with the nested Setup's ID into the
+group dict, then `TryGetRenderData(nestedSetupId)` returns a Setup record,
+and Render's `if (renderData != null && !renderData.IsSetup)` check fails →
+instance silently dropped. Some renderings missing in specific cases but not
+the catastrophic chaos.
+
+Fixed as a defense-in-depth measure to match WB exactly.
+
+## Fix plan
+
+### Required (closes Bugs #1 + #2):
+
+1. **`EnvCellRenderer.cs:959-969`** — Add `list.Clear()` before return in
+   `GetPooledList` reuse branch.
+2. **`EnvCellVisibilitySnapshot.cs`** — Add `public int PostPreparePoolIndex { get; init; }`.
+3. **`EnvCellRenderer.cs:523-534`** — Capture `_poolIndex` into snapshot's
+   `PostPreparePoolIndex` field during the atomic swap.
+4. **`EnvCellRenderer.cs:614`** — Read `snapshot.PostPreparePoolIndex`
+   instead of `snapshot.BatchedByCell.Count`.
+
+### Recommended (closes Bug #3):
+
+5. **`EnvCellRenderer.cs:358`** — `isSetup: (partId >> 24) == 0x02` to
+   correctly detect nested Setups.
+
+### Apparatus (safety net for any unidentified bug):
+
+6. **GL state assertion probe extension** — extend `EmitDrawOrderProbe` to
+   log full GL state at each step boundary and compare against WB-expected
+   state. Lifts cost of finding the NEXT bug (if any) from a multi-hour
+   speculation cycle to a one-launch evidence cycle.
+
+## Regression coverage
+
+The pool aliasing bug is hard to unit-test without a GL context (Render is the
+mutation site, and Render requires a shader and global mesh buffer). A targeted
+test for the simpler invariant — `GetPooledList` returns a cleared list — is
+sufficient: it directly verifies Bug #1's fix and provides a guard against
+future regressions. Bug #2 is verified by the visual gate.
+
+## Process retrospective
+
+The "trust-but-verify subagent work" item in the handoff doc's Process
+retrospective §3 names exactly this failure mode: the subagent wrote
+`EnvCellRenderer.cs` (Wave 2, ~1013 LOC), build-test-green checks passed, but
+the audit never happened. The five post-Wave-5 speculative fixes were chasing
+symptoms because the audit step was skipped.
+
+The cost of the missed audit: ~1 hour of speculation across five fix attempts,
+plus the user's five failed visual gates.
+The cost of the audit, when actually performed: ~30 minutes to find both root
+cause bugs via line-by-line comparison.
+
+**Rule for subagent-written code touching production paths:** always read the
+diff line-by-line against the cited WB / retail source before merging. Bugs
+in subagent code that "build clean and pass tests" can still be 100%-wrong on
+the algorithm — the test coverage was not designed to catch them.