feat(render): Phase A8 — kill-switch ACDREAM_A8_INDOOR_BRANCH (default OFF)

After 5 visual-gate failures with speculative fixes that each addressed
plausible-looking symptoms without resolving the chaos (texture flicker,
missing walls, GPU 100%, ~10 FPS), this commit stops the speculation and
ships a kill-switch that reverts default behavior to pre-A8.

The user's verbatim authorization at session start said "no quickfixes
or fixes that might cause issues down the line ... no band-aids." The
post-Wave-5 fix stream WAS band-aids — each fix was pattern-matched
against possible RR7-era causes without confirming the actual root
cause from evidence. Five failures in a row is the signal to stop.

ACDREAM_A8_INDOOR_BRANCH gate:
- Unset or != "1" (DEFAULT): cameraInsideBuilding forced false. Outdoor
  Draw(All) path runs for indoor cells too. Pre-A8 depth-clear-if-inside
  workaround at line ~7314 is restored. Visual behavior = pre-A8.
- Set to "1": indoor branch (RenderInsideOutAcdream) runs. All A8 code
  exercises. Probes ([envcells]/[stencil]/[draworder]/[buildings]) emit.

All Phase A8 scaffolding (Waves 1-5 + post-Wave-5 fix commits) remains
in tree, accessible for the next-session apparatus to test against.
~1,830 LOC of WB-extracted infrastructure preserved.

Handoff doc at docs/research/2026-05-28-a8-wb-port-shipped-but-broken-handoff.md
captures the full chronicle: which fixes were applied, what each
visual-gate launch reported, the root-cause hypotheses tested and
falsified, the remaining unknowns, and the recommended apparatus
approaches (frame-replay harness / per-step GL state probe / WB-renderer
side-by-side / mesh-data audit).

Next session's mission: NO MORE LIVE LAUNCHES until apparatus is built.
This is the same trap the issue #98 saga fell into before the
trajectory-replay harness shipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erik 2026-05-27 16:26:40 +02:00
parent 2bf5013c2f
commit 3d0ffaa794
2 changed files with 312 additions and 4 deletions

View file

@ -0,0 +1,293 @@
# Phase A8 WB port — scaffolding shipped, indoor branch broken — handoff (2026-05-28)
## TL;DR for next session
The Phase A8 WB RenderInsideOut port shipped across six waves of commits.
The indoor branch FIRES correctly per probe data, but renders incorrectly
on visual inspection (texture flicker, missing walls, GPU 100%, ~10 FPS).
Five post-Wave-5 speculative fixes did not resolve the visual breakage.
**Default behavior is now kill-switched back to pre-A8 (working build).
Set `ACDREAM_A8_INDOOR_BRANCH=1` to re-enable the indoor branch for
further investigation.**
Today's session shipped 16 commits on top of `3e9ff7a` (the RR7 revert
chain). The WB scaffolding is solid (RenderPass enum, Frustum, SceneryInstance/
EnvCellLandblock, VisibilityVisibilitySnapshot, EnvCellRenderer, IndoorCellStencilPipeline
extension). The render-frame integration is broken in a way that
required-more-iteration than this session had budget for.
**Next session's mission:** stop speculating, build the apparatus.
Either a deterministic replay harness for the render frame (record frame
inputs, replay offline, diff per-step GL state) or a side-by-side
comparison harness against WB's own renderer. The pattern is the same one
that finally cracked issue #98 after six months of failed speculative fixes.
User direction (verbatim, 2026-05-27): "no quickfixes or fixes that
might cause issues down the line ... use superpowers but DONT stop me
for questions, be perfect, no bandaids."
Today's session had no bandaids in the WB extraction work (Waves 1-5).
The five post-Wave-5 fix attempts WERE band-aids — each chased a
plausible-looking symptom without confirming root cause. After five in a
row failed, I stopped and shipped the kill-switch instead. That's the
process correction: an apparatus comes before the next attempt.
## What shipped today
| SHA | Wave | Description | Status |
|---|---|---|---|
| `fc68d6d` | 1 | WB scaffolding extraction — RenderPass enum, Frustum + WbBoundingBox + FrustumTestResult, SceneryInstance + EnvCellLandblock, EnvCellVisibilitySnapshot, IndoorCellStencilPipeline.RenderBuildingStencilMask low-level | **KEPT** |
| `f16b8e9` | 2 | EnvCellRenderer (WB EnvCellRenderManager port, 1013 LOC + inline RenderModernMDIInternal) | **KEPT** |
| `aad9ed4` | 2.1 | Shader API: GLSLShader → legacy Shader (Bind→Use, SetUniform→SetInt etc.) | **KEPT** |
| `4b4f687` | 3 | Wire EnvCellRenderer into landblock streaming (BuildInteriorEntitiesForStreaming + FinalizeLandblock + RemoveLandblock) | **KEPT** |
| `f9a644a` | 4 | RenderInsideOutAcdream byte-for-byte WB port in GameWindow.cs | **KEPT** |
| `8532c84` | 5 | Probes — [envcells]/[stencil]/[draworder]/[buildings] | **KEPT** |
| `0fc6003` | 5.1 | Stamp BuildingId on already-loaded cells (richer cellsByCellId dict to BuildingLoader.Build) | **KEPT** |
| `5d41876` | 5.2 | Normalize _buildingRegistries key (`& 0xFFFF0000u`) — RR7.2 saga, landed properly | **KEPT** |
| `9c59910` | 6 | Step 5 gate-off-by-default + GL state restore at cleanup | **KEPT** |
| `9ee42d4` | 7 | Invalidate `_currentVao` / `_currentCullMode` static caches at Render entry | **KEPT** |
| `f143ece` | 8 | Skip line-7200 terrain.Draw when cameraInsideBuilding (avoid double-terrain Z-fight) | **KEPT** |
| `2bf5013` | 9 | Render IndoorPass entities (IsBuildingShell building walls) between Step 3 and Step 4 | **KEPT** |
| `<this commit>` | 10 | **Kill-switch:** ACDREAM_A8_INDOOR_BRANCH gate; default OFF; pre-A8 depth-clear workaround restored when off | **KEPT** |
All ~1,830 LOC of new code remains in tree, accessible for the next-session apparatus to test against.
## Visual-gate chronicle
**Launch 1** (after Waves 1-5 + key/BuildingId fix at `5d41876`)
- `[buildings] camCell=0xA9B40143 camBldgs=[0x7] otherBldgs=109 totalKnown=110`
- `[envcells] cells=18 tris=74932 ourBldgs=1 otherBldgs=109 filterCnt=18`
- `[stencil] op=mark/punch bld=0x7 verts=39`
- `[draworder]` showed full Step 1 → 2 → 3 → 4 → 5{a,b,c,d} cycle
- Probes confirmed the WB pipeline firing — user verification gate triggered.
- User report: **"completely unplayable, textures flickering, can barely move, ACdream crashed especially indoor."**
**Launch 2** (after `9c59910` Step 5 disable + ColorMask restore at cleanup)
- User report: **"still chaos, GPU 100%"**
**Launch 3** (after `9ee42d4` cull-cache invalidation)
- User report: **"Cant see anything, flickering colors, sometimes I see textures and sometimes I see inside, the house is missing lots of walls. 10 FPS."**
**Launch 4** (after `f143ece` skip-double-terrain)
- User report: **"No difference"**
**Launch 5** (after `2bf5013` IndoorPass added)
- User screenshot showed: mostly fog-color screen with a thin black diagonal sliver. "Basically the same. Screen is flickering like this."
- This was the call to ship the kill-switch instead of continuing.
## Root-cause hypotheses tested and falsified
Each of these was a plausible single-cause hypothesis; the visual didn't
improve materially after fixing them. They may be PARTIALLY correct
(some of them ARE bugs the next session needs to keep fixed), but none
were THE cause of the visual chaos.
1. **Step 5 perf / TDR (commit `9c59910`)** — Step 5 was iterating 109
other-buildings/frame with no frustum cull, 545 GL draws/frame, plausible
driver TDR cause. Disabled → "still chaos, GPU 100%." Was a perf risk but
not the visual root cause.
2. **ColorMask alpha-bit off at cleanup (`9c59910`)** — WB exits with
ColorMask(t,t,t,FALSE). Our pipeline uses alpha-to-coverage; subsequent
passes need alpha writes. Fix didn't visually improve anything.
3. **`_currentCullMode` / `_currentVao` stale (`9ee42d4`)** — these are
STATIC caches in EnvCellRenderer; other consumers (dispatcher, terrain
renderer) change actual GL state without updating them, leaving the
cache lying. WB invalidates at Render entry; our port missed that.
Confirmed-real bug, fix didn't visually improve anything.
4. **Double-terrain draw (`f143ece`)** — line 7200 unconditional terrain
+ Step 4 stencil-gated terrain = terrain drawn twice indoor. Was a
perf doubler + Z-fight cause. Fixed; no visual improvement.
5. **Missing IndoorPass (`2bf5013`)** — pre-A8 indoor walls came from
`IsBuildingShell` entities rendered via `Draw(set: IndoorPass)`.
A8's Step 4 calls `Draw(set: OutdoorScenery)` which excludes
shells. EnvCellRenderer provides cell mesh (floor + CellStruct walls)
but landblock-baked exterior wall shells were never drawn. Fixed;
no visual improvement.
## What's STILL likely wrong
The remaining unknowns (any combination could be the visual root cause):
1. **EnvCellRenderer.RenderModernMDIInternal** — the inlined `RenderModernMDI`
extraction from `BaseObjectRenderManager.cs:709-848`. This is single-slot
(vs WB's 3-slot ring) and may have subtle differences from WB. Particularly
suspect: the buffer orphan-and-write pattern, the SSBO bind layout, the
`MemoryBarrier` placement, the `glMultiDrawElementsIndirect` offset
arithmetic. Without a per-draw comparison against WB output, we can't tell.
2. **`InstanceData.CellId` population** — our subagent stored cellId in
`EnvCellSceneryInstance.InstanceId` (a `uint`) because WB's ObjectId
struct wasn't easily portable. Whether `instance.CurrentPreviewCellId != 0
? instance.CurrentPreviewCellId : instance.InstanceId` resolves to the
same value WB expects needs verification — a subtle off-by-one or
wrong-bits here would cause the per-cell filter logic in PrepareRenderBatches
to misclassify which cells go in which group.
3. **Cell-mesh async upload race**`PrepareEnvCellGeomMeshDataAsync` is
fire-and-forget. By the time `Render` runs, `TryGetRenderData(cellGeomId)`
may return null for some cells (mesh not uploaded yet). Render silently
skips those, producing "missing walls until upload completes" — but
the user reports persistent flicker, not transient initial-load black.
4. **Shader uniform leakage between consumers**`_meshShader` is shared
between EnvCellRenderer and WbDrawDispatcher. Each consumer sets
uniforms (`uRenderPass`, `uDrawIDOffset`, `uHighlightColor`,
`uFilterByCell`). They might leak between calls in ways that subtly
affect output. The dispatcher's own `SetInt` calls should overwrite
but ordering matters.
5. **PrepareRenderBatches `filter == null` semantics** — we call
PrepareRenderBatches with `filter = null` at the top of the frame, then
call Render(filter: cellIds) inside Step 3. Per WB EnvCellRenderManager
line 282-285, when PrepareRenderBatches has a non-null filter, it
skips instances whose CellId isn't in the filter. We pass null so it
includes EVERYTHING. Then per-Render(filter:) the filter narrows for
that draw. This is the WB pattern. BUT in our PrepareRenderBatches we
ALWAYS pass null — even WB sometimes passes the visible-cell set into
PrepareRenderBatches. Worth checking if our approach causes the
per-Render filter to silently miss data.
6. **Building shells filter** — IsBuildingShell entities have no
ParentCellId, so they pass the visibleCellIds filter unconditionally
in the IndoorPass call we added. But the dispatcher's `EntityMatchesSet`
logic might filter them OUT in another path. Verify by adding a
`[indoorpass]` probe that counts how many IsBuildingShell entities
actually drew.
## Recommended next-session apparatus
This is the saga that finally needs the apparatus pattern. From
`docs/research/2026-05-23-a6-p3-issue98-comparison-harness-findings.md`
(the #98 saga that took six speculative fixes before the apparatus shipped):
> The right move is to build a deterministic apparatus that captures
> live state, replays it offline, and diffs per-step. Iteration in
> <500ms beats five-minute live-test cycles every time.
For A8, the analog apparatus is:
1. **Frame-replay harness for the indoor branch.** Capture every frame's
inputs to RenderInsideOutAcdream (viewProj, camPos, camera-buildings,
other-buildings, GL state on entry), persist as JSONL. Replay test
in xUnit: instantiate a headless GL context, run RenderInsideOutAcdream
on the captured inputs, snapshot the output framebuffer to PNG, diff
against an "expected" PNG. Five seconds of in-game capture = 300
frames = 300 replayable tests.
2. **Per-step GL state assertion probe.** Extend `EmitDrawOrderProbe` to
also log the full relevant GL state (cull mode, blend mode, polygon
offset, depth range, stencil func/op/mask, color mask). Compare
step-by-step against WB's expected state (lifted from
`references/WorldBuilder/.../VisibilityManager.cs:73-239` line by
line). Any divergence = the bug.
3. **Side-by-side with WB's own renderer.** WB renders Holtburg correctly.
Compile WB's own executable, point it at the same dat, capture a
screenshot from a known camera position inside a Holtburg cottage.
Compare against acdream's screenshot from the same camera. Pixel-diff
tells us how far off we are. If WB's screenshot also looks broken,
the bug is in our dat handling. If WB's looks right, the bug is in
our render-frame integration.
4. **Mesh-data audit.** Add a probe that, when ACDREAM_A8_AUDIT=1,
walks every cell in a camera-building and dumps:
- cellGeomId
- `_meshManager.TryGetRenderData(cellGeomId)` non-null?
- `renderData.Batches.Count`
- `renderData.Batches[i].IndexCount`, `BaseVertex`, `FirstIndex`
- `renderData.Batches[i].CullMode`, `IsTransparent`, `IsAdditive`,
`BindlessTextureHandle`, `TextureIndex`
If batches' IndexCount is 0 or `BindlessTextureHandle` is 0, the mesh
upload is incomplete. If everything is non-zero but visual is wrong,
the bug is in the draw path.
The apparatus pattern means: NO MORE LIVE LAUNCHES until the apparatus
can reproduce the bug deterministically. Five-minute launch cycles
masking the problem are exactly the trap that ate today's session.
## Process retrospective — what the next session should NOT do
These are the patterns from today that drove the session into the saga
trap:
1. **Treating "[probes fire correctly] but [visual broken]" as a small
gap to bridge with one more fix.** That gap was actually a large
integration bug under the surface; each speculative fix wedged
another fragment of behavior in without addressing it.
2. **Pattern-matching old RR7 saga symptoms to today's situation.** The
key/BuildingId fixes WERE real RR7-saga bugs that needed re-fixing.
But the post-key visual breakage was NEW — applying RR7-era fix
templates (ColorMask, depth-clear) without re-deriving from evidence
was lazy.
3. **Trust-but-verify on subagent work.** Both Task 5 (EnvCellRenderer
port) and Task 8 (render-frame integration) were dispatched to
subagents. I verified compile + tests-pass but did NOT line-by-line
read the diff for WB-faithfulness. The flicker bug may be in code I
never actually read.
4. **Counting [draworder] frames as "correctness".** Probe data showed
1,595 indoor frames with valid stencil/cells/buildings outputs. That
counted as "process complete" — but the visual still chaos. The
probes are NECESSARY but NOT SUFFICIENT.
## Files touched
| File | Status |
|---|---|
| `src/AcDream.App/Rendering/Wb/WbRenderPass.cs` | NEW (verbatim) |
| `src/AcDream.App/Rendering/Wb/WbFrustum.cs` | NEW (verbatim + tests) |
| `src/AcDream.App/Rendering/Wb/EnvCellSceneryInstance.cs` | NEW (verbatim with renames + tests) |
| `src/AcDream.App/Rendering/Wb/EnvCellVisibilitySnapshot.cs` | NEW (verbatim narrowed) |
| `src/AcDream.App/Rendering/Wb/EnvCellRenderer.cs` | NEW (1013 LOC, WB EnvCellRenderManager port + tests) |
| `src/AcDream.App/Rendering/Wb/WbMeshAdapter.cs` | MODIFIED (MeshManager accessor) |
| `src/AcDream.App/Rendering/IndoorCellStencilPipeline.cs` | MODIFIED (RenderBuildingStencilMask low-level + probe fields) |
| `src/AcDream.App/Rendering/CellVisibility.cs` | MODIFIED (GetCellsForLandblock for richer stamping dict) |
| `src/AcDream.App/Rendering/GameWindow.cs` | MODIFIED (extensive: ctor wiring, RegisterCell call site, RenderInsideOutAcdream method, probe emitters, kill-switch) |
| `src/AcDream.Core/Rendering/RenderingDiagnostics.cs` | MODIFIED (ProbeEnvCellEnabled flag) |
| `docs/superpowers/plans/2026-05-28-phase-a8-wb-render-inside-out-port.md` | NEW (plan committed at session start) |
| `docs/research/2026-05-28-a8-wb-port-shipped-but-broken-handoff.md` | NEW (this doc) |
## Kill-switch usage
To re-enable the indoor branch for investigation (next session):
```powershell
$env:ACDREAM_A8_INDOOR_BRANCH = "1" # enable the broken indoor branch
$env:ACDREAM_PROBE_VIS = "1" # emit [envcells]/[stencil]/[draworder]/[buildings]
$env:ACDREAM_A8_STEP5 = "1" # optional: enable Step 5 cross-building visibility
```
When `ACDREAM_A8_INDOOR_BRANCH` is unset or != "1":
- `cameraInsideBuilding` is forced false regardless of actual state
- The outdoor branch (`Draw(set: All)`) runs for indoor cells too
- The pre-A8 `if (cameraInsideCell) Clear(DepthBufferBit)` workaround is
restored
- Visual behavior matches pre-A8 baseline
- All A8 code remains in tree (probes silent, indoor branch unreachable)
## Pickup prompt for next session
> Read `docs/research/2026-05-28-a8-wb-port-shipped-but-broken-handoff.md`
> in full. Then read the kill-switch implementation in `GameWindow.cs`
> at the `cameraInsideBuilding` declaration (~line 7110) to see how
> default behavior is restored.
>
> Your mission: BUILD APPARATUS, not more speculative fixes. The
> apparatus options (frame-replay harness / per-step GL state probe /
> WB-renderer side-by-side / mesh-data audit) are listed in the
> "Recommended next-session apparatus" section. Pick one, build it,
> use it.
>
> Do NOT launch the client live with `ACDREAM_A8_INDOOR_BRANCH=1` more
> than ONCE before the apparatus is in place. The session's failure
> mode is exactly the pattern of "one more fix, then test live again"
> that ate six attempts in the #98 saga before the apparatus saved it.
>
> User authorization remains: "no questions, no band-aids." The
> apparatus build is NEITHER — it's the proper investigation path.