acdream/docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md
Erik 9559726960 fix(render): Phase A8 — pool aliasing in EnvCellRenderer (visual chaos root cause)
The post-Wave-5 indoor branch chaos (flickering, missing walls, GPU 100%,
~10 FPS) is caused by two interconnected pool-management bugs in
EnvCellRenderer that line-by-line WB comparison surfaced in 30 minutes.
Neither was found by the five post-Wave-5 speculative fixes because none
of them inspected the pool path.

Bug #1 — GetPooledList missing list.Clear():
The reuse branch returned pool lists with prior-frame data still inside.
PrepareRenderBatches' merge phase pattern `gfxDict[k] = list; list.AddRange(...)`
assumes empty lists. Without Clear(), lists grow unbounded each frame, GPU
draws cumulative instance counts, and per-instance transforms become a stew
of past + present data. Mirrors WB ObjectRenderManagerBase.cs:1221-1233.

Bug #2 — Render uses snapshot.BatchedByCell.Count instead of PostPreparePoolIndex:
The snapshot author dropped WB's PostPreparePoolIndex field calling it
"scenery-only," then "compensated" in Render by setting _poolIndex to the
cell count. The cell count has no relation to the pool — Prepare may have
used 50+ pool lists for an 18-cell scene. Render's filter-path GetPooledList
then returns lists that ARE in snapshot.BatchedByCell, corrupting the snapshot
mid-Render. Restoring PostPreparePoolIndex (WB VisibilitySnapshot.cs:31)
correctly places Render's pool cursor past the snapshot's owned region.

Bug #3 (minor) — PopulateRecursive hardcoded isSetup:false for nested parts:
Setup IDs use high-byte 0x02 (per retail). WB ObjectRenderManagerBase.cs:813
checks `(partId >> 24) == 0x02` to detect nested Setups. Our port always
passed isSetup:false, silently dropping any nested Setup (its TryGetRenderData
returns IsSetup=true, Render's `!IsSetup` guard skips the draw). Probably
rare in EnvCells but fixed for completeness.

Regression coverage:
- GetPooledList_ReusedList_IsClearedBeforeReturn — would have failed pre-fix
- GetPooledList_FreshList_IsAlwaysEmpty — sanity check
- Snapshot_PostPreparePoolIndex_IsInitSettable — compile-time guarantee
- Snapshot_PostPreparePoolIndex_DefaultsToZero — defensive default

86/86 App tests pass. Build green. The fix is the audit's primary
deliverable; the GL state probe option-1 apparatus follows in a separate
commit as defense-in-depth for any unidentified residual issue.

Full audit + WB cross-reference in
docs/research/2026-05-28-a8-env-cell-renderer-audit-findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 19:08:49 +02:00

260 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase A8 — EnvCellRenderer line-by-line audit findings (2026-05-28 PM)
## TL;DR
The post-Wave-5 visual chaos (flickering colors, missing walls, GPU 100%, ~10 FPS)
is caused by **two interconnected pool-management bugs** in
`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs`. Both were introduced by the
WB port author's "simplification" — dropping `PostPreparePoolIndex` from the
snapshot (calling it scenery-only) and dropping `list.Clear()` from
`GetPooledList`. Either bug alone is enough to corrupt rendering; together they
explain every reported symptom precisely.
Neither bug was found by the five post-Wave-5 speculative fixes because none
of them inspected the pool-management code path. The bug is in code the
subagent wrote that nobody had read line-by-line against WB. The audit pattern
in the handoff doc (Process retrospective §3 "Trust-but-verify on subagent
work") found it in under 30 minutes once actually applied.
A third minor bug in `PopulateRecursive` is also documented for completeness.
## Bug #1 — `GetPooledList` is missing `list.Clear()`
### Code as shipped
`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:959-969`:
```csharp
private List<InstanceData> GetPooledList()
{
lock (_listPool)
{
if (_poolIndex < _listPool.Count) return _listPool[_poolIndex++]; // ← MISSING Clear()
var fresh = new List<InstanceData>();
_listPool.Add(fresh);
_poolIndex++;
return fresh;
}
}
```
### WB canonical implementation
`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectRenderManagerBase.cs:1221-1233`:
```csharp
protected List<InstanceData> GetPooledList() {
lock (_listPool) {
if (_poolIndex < _listPool.Count) {
var list = _listPool[_poolIndex++];
list.Clear(); // ← CRITICAL: clears stale data from prior frames
return list;
}
var newList = new List<InstanceData>();
_listPool.Add(newList);
_poolIndex++;
return newList;
}
}
```
### Effect of the bug
`PrepareRenderBatches`'s merge phase pattern (`EnvCellRenderer.cs:494-501`) is:
```csharp
if (!gfxDict.TryGetValue(gfxKvp.Key, out var list))
{
list = GetPooledList();
gfxDict[gfxKvp.Key] = list;
}
list.AddRange(gfxKvp.Value); // ← appends to whatever was in list
```
Each frame, when `GetPooledList` returns a previously-used pool list:
- That list STILL contains prior frames' instance data
- `list.AddRange(gfxKvp.Value)` appends the new frame's data on top
- Lists grow unbounded across frames
- The GPU receives buffers with N frames' worth of instance data
After ~50 frames at 60Hz, every batch's instance count is ~50× what it should
be. The GPU processes 50× more vertices per draw. Hence GPU 100% + low FPS.
Worse: the per-instance transforms in the appended data are STALE — they
represent where instances were in prior frames, not now. Hence "flickering
colors" (transforms appear/disappear randomly) and "missing walls" (instances
whose latest data has been buried under stale entries).
## Bug #2 — `_poolIndex = snapshot.BatchedByCell.Count` instead of `snapshot.PostPreparePoolIndex`
### Code as shipped
`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:614`:
```csharp
lock (_renderLock)
{
var snapshot = _activeSnapshot;
_shader.Use();
_poolIndex = snapshot.BatchedByCell.Count; // reset point (mirrors WB line 405) ← WRONG
```
And `EnvCellVisibilitySnapshot.cs:11-13` (the snapshot definition):
```csharp
/// The scenery-side VisibleGroups / VisibleGfxObjIds / IntersectingLandblocks
/// / PostPreparePoolIndex are dropped — we render scenery through
/// WbDrawDispatcher, not through this snapshot.
```
### WB canonical implementation
`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/EnvCellRenderManager.cs:405`:
```csharp
_poolIndex = snapshot.PostPreparePoolIndex;
```
And `references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/VisibilitySnapshot.cs:31`:
```csharp
public int PostPreparePoolIndex { get; init; }
```
Stored at `EnvCellRenderManager.cs:367` during the Prepare→snapshot swap:
```csharp
_activeSnapshot = new VisibilitySnapshot {
BatchedByCell = newBatchedByCell,
...
PostPreparePoolIndex = _poolIndex // ← captured before _poolIndex reset
};
```
### Effect of the bug
The snapshot author's reasoning ("PostPreparePoolIndex is scenery-side, we
don't need it") is **wrong**. PostPreparePoolIndex has nothing to do with
scenery — it's the pool-index high-water mark from Prepare's merge phase,
which tells Render's filter-path where the pool's safe region begins.
Specifically:
- After Prepare, `_listPool[0..K-1]` hold the data referenced by
`snapshot.BatchedByCell`. K = `_poolIndex` after merge.
- For Render to call `GetPooledList` safely (without trampling
snapshot data), `_poolIndex` must be set to K, so new pool calls return
`_listPool[K..]` — past the snapshot's region.
Our broken code sets `_poolIndex = BatchedByCell.Count`, which has NO RELATION
to the pool's high-water mark. For a typical indoor scene at Holtburg with
~5-20 cells visible, `BatchedByCell.Count` ≈ 18, but Prepare may have used
50-100 pool lists (lists per unique (cellId, gfxObjId) combo).
When Render's filter-path (`EnvCellRenderer.cs:684`) calls `GetPooledList`,
it returns a list at index 18 — which IS inside `snapshot.BatchedByCell`. With
Bug #1 fixed (Clear added), `Clear()` wipes that snapshot list. With Bug #1
unfixed, `AddRange` corrupts it. Either way: snapshot data destroyed mid-Render.
## Combined effect — why both bugs together explain the symptoms
| Symptom | Bug #1 alone | Bug #2 alone | Both together |
|---|---|---|---|
| GPU 100%, ~10 FPS | YES (lists grow per frame) | partial | YES (compounding) |
| Flickering colors | YES (stale + new transforms mixed) | YES (snapshot lists overwritten mid-render) | YES (both effects) |
| Missing walls | YES (new instances buried under stale) | YES (snapshot data wiped before draw) | YES (both effects) |
| Black diagonal sliver (final report) | YES (lists hit memory pressure / driver limits) | YES (snapshot corruption produces degenerate transforms) | YES |
The "ColorMask alpha-bit off" / "cull-cache stale" / "double terrain" / "missing
IndoorPass" fixes from the five post-Wave-5 attempts may all be real (some are
real bugs, just not THE bug). They didn't help visually because the pool
aliasing was always the dominant cause.
## Bug #3 (minor) — `PopulateRecursive` always passes `isSetup: false`
### Code as shipped
`src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs:358`:
```csharp
foreach (var (partId, partTransform) in rd.SetupParts)
{
var combined = partTransform * transform;
PopulateRecursive(group, partId, isSetup: false, combined, cellId); // ← always false
}
```
### WB canonical implementation
`references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectRenderManagerBase.cs:813`:
```csharp
foreach (var (partId, partTransform) in renderData.SetupParts) {
PopulateRecursive(groups, partId, (partId >> 24) == 0x02, partTransform * transform, cellId, flags);
}
```
WB detects nested Setups by checking `(partId >> 24) == 0x02` (Setup IDs use
the 0x02 high byte in retail). Our code always treats child parts as plain
GfxObjs.
### Effect
For most cells in Holtburg, this doesn't matter — typical cells reference
simple GfxObj parts (high byte 0x01). But if any Setup contains a nested
Setup, our code adds a stub InstanceData with the nested Setup's ID into the
group dict, then `TryGetRenderData(nestedSetupId)` returns a Setup record,
and Render's `if (renderData != null && !renderData.IsSetup)` check fails →
instance silently dropped. Some renderings missing in specific cases but not
the catastrophic chaos.
Fixed as a defense-in-depth measure to match WB exactly.
## Fix plan
### Required (closes Bugs #1 + #2):
1. **`EnvCellRenderer.cs:959-969`** — Add `list.Clear()` before return in
`GetPooledList` reuse branch.
2. **`EnvCellVisibilitySnapshot.cs`** — Add `public int PostPreparePoolIndex { get; init; }`.
3. **`EnvCellRenderer.cs:523-534`** — Capture `_poolIndex` into snapshot's
`PostPreparePoolIndex` field during the atomic swap.
4. **`EnvCellRenderer.cs:614`** — Read `snapshot.PostPreparePoolIndex`
instead of `snapshot.BatchedByCell.Count`.
### Recommended (closes Bug #3):
5. **`EnvCellRenderer.cs:358`** — `isSetup: (partId >> 24) == 0x02` to
correctly detect nested Setups.
### Apparatus (safety net for any unidentified bug):
6. **GL state assertion probe extension** — extend `EmitDrawOrderProbe` to
log full GL state at each step boundary and compare against WB-expected
state. Lifts cost of finding the NEXT bug (if any) from a multi-hour
speculation cycle to a one-launch evidence cycle.
## Regression coverage
The pool aliasing bug is hard to unit-test without a GL context (Render is the
mutation site, and Render requires a shader and global mesh buffer). A targeted
test for the simpler invariant — `GetPooledList` returns a cleared list — is
sufficient: it directly verifies Bug #1's fix and provides a guard against
future regressions. Bug #2 is verified by the visual gate.
## Process retrospective
The "trust-but-verify subagent work" item in the handoff doc's Process
retrospective §3 names exactly this failure mode: the subagent wrote
`EnvCellRenderer.cs` (Wave 2, ~1013 LOC), build-test-green checks passed, but
the audit never happened. The five post-Wave-5 speculative fixes were chasing
symptoms because the audit step was skipped.
The cost of the missed audit: ~1 hour of speculation across five fix attempts,
plus the user's five failed visual gates.
The cost of the audit, when actually performed: ~30 minutes to find both root
cause bugs via line-by-line comparison.
**Rule for subagent-written code touching production paths:** always read the
diff line-by-line against the cited WB / retail source before merging. Bugs
in subagent code that "build clean and pass tests" can still be 100%-wrong on
the algorithm — the test coverage was not designed to catch them.