docs: Phase A8 — session 2 handoff (pool fix shipped + 4 partial fixes + residuals)

After 5 visual gates, the session shipped 5 commits closing real bugs
(pool aliasing was the catastrophic root cause), but residual symptoms
(transparent floor, texture warping, flickering, distortion) didn't
yield to surgical fixes. Per systematic-debugging skill's >=3-failures
rule, stop and capture state.

Doc covers:
- Pool aliasing root cause + fix (the big win — closes session-1's
  visual chaos).
- Sky-when-building, LiveDynamic, Landblock→None — all real bug closures.
- Apparatus state (GL state probe + per-cell audit + pool diagnostics).
- Three theories for the residual issues (FrontFace=CW global match to
  WB / per-poly Stippling audit / WB side-by-side render).
- Pickup prompt for next session with ranked options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erik 2026-05-27 20:32:22 +02:00
parent d5deeb3314
commit e415bb3863

View file

@ -0,0 +1,218 @@
# Phase A8 — Session 2: pool fix shipped, 4 more fixes shipped, residual visuals remain (2026-05-28 PM)
## TL;DR for next session
The session-1 handoff said "BUILD APPARATUS, NOT MORE SPECULATIVE FIXES." I
built apparatus (per-step GL state probe + per-cell mesh audit + pool
diagnostics) AND, before the apparatus was used, line-by-line audited
`EnvCellRenderer.cs` against WB source. The audit found **two
high-confidence bugs** (pool aliasing) in 30 minutes — these were the
root cause of the post-Wave-5 catastrophic visual chaos. Pool fix shipped
(`9559726`) and the visual went from "thin black diagonal sliver, GPU
100%, 10 FPS, can't see anything" to "walls + objects + sky render
cleanly, FPS normal."
Five more targeted fixes shipped across visual gates #1-#5. The first
four landed real bugs. The fifth (cull-restore revert) was based on a
hypothesis the [draworder] probe data invalidated — gate-#5 showed cull
state was already off at Step 3 before EnvCellRenderer.Render ran, so
the propagation theory didn't apply.
**Per systematic-debugging skill's `≥3-failures → question architecture`
rule, I stopped and wrote this handoff rather than ship a 6th speculative
fix.** The remaining symptoms (transparent floor, texture warping,
distortion) point to architectural-level issues that need a different
investigation approach.
## Visual progress chronicle
| Gate | Symptoms reported | Cause if known |
|------|-------------------|---|
| Pre-session (from session-1 handoff) | "Thin black diagonal sliver, GPU 100%, 10 FPS, can't see anything" | Pool aliasing (cleared by session-2 commit `9559726`) |
| Gate #1 (`375f9a7` + sky-fix not yet) | Walls + objects render, no flicker, FPS normal. No sky through windows. Char + doors missing. Floor missing. Purple tint on walls. | Pool fixed (huge win). LiveDynamic/sky/cull not yet addressed. |
| Gate #2 (sky fix + audit probe) | Sky visible through windows ✓. Char + doors still missing. Floor still missing. Purple still. | Sky fix worked. Audit dumped per-cell render data. |
| Gate #3 (LiveDynamic + cull-disable A/B) | Char + doors visible ✓. Floor sometimes visible. See-through-head (cull-off side effect). | LiveDynamic fix worked. Cull-disable proved cull was hiding floor. |
| Gate #4 (Landblock→None + cull-restore) | "BROKEN textures, floor is now transparent" — sky visible through floor | Cull-restore at exit propagated cull-back to dispatcher's IndoorPass, culling cottage shell's floor poly. |
| Gate #5 (revert cull-restore) | "No change at all, textures warped, missing textures, floors transparent and flickering" | Revert didn't help — [draworder] probe shows cull was already off at Step 3 entry, so removing my cull-restore at exit doesn't change inherited state. |
## What's shipped this session
| SHA | Description | Status |
|-----|-------------|--------|
| `9559726` | Pool aliasing root cause fix (Clear + PostPreparePoolIndex + nested-Setup detection) + 4 regression tests + audit findings doc | **KEPT — closes the post-Wave-5 chaos** |
| `375f9a7` | Full GL state probe + pool diagnostics extension (option-1 apparatus) | **KEPT — apparatus** |
| `772d69c` | Sky-when-cameraInsideBuilding fix + per-cell audit probe | **KEPT — sky through windows works** |
| `b19f3c1` | LiveDynamic dispatcher call in indoor branch + ACDREAM_A8_DISABLE_CULL A/B gate | **KEPT — chars + doors visible inside** |
| `0940d79` | Cell-mesh Landblock CullMode → None + cull-state restore at exit | **PARTIALLY KEPT — Landblock→None is good; cull-restore was wrong (reverted in d5deeb3)** |
| `d5deeb3` | Revert cull-restore at EnvCellRenderer exit | **KEPT — leaves cull-off propagating** |
## What's still wrong (visual gate #5 state)
User-reported symptoms with kill-switch ON (`ACDREAM_A8_INDOOR_BRANCH=1`):
1. **Floor transparent** — sky color visible where floor should be. Cell
mesh has Landblock→None override that should render cell polys
double-sided, but the floor poly either (a) isn't in the upload, (b)
has wrong winding/orientation, or (c) is being rendered but z-fails or
alpha-discards.
2. **Texture warping** — vague but visible in screenshots. Some surfaces
show wrong texture or texture appears stretched/distorted.
3. **Flickering** — surfaces alternate between visible/invisible across
frames. Could be Z-fighting (cell mesh vs cottage shell at same depth),
alpha-test threshold instability, or animated camera causing
per-frame frustum-test results to differ.
4. **General distortion** — overall scene "looks broken." Possibly purple
tint on lighting (mentioned in gates #1-#3, not explicitly in #5).
## Apparatus state
These probes are wired and operate when env vars are set:
- `ACDREAM_PROBE_VIS=1` — emits `[draworder]` (per-step GL state),
`[stencil]` (per stencil mark/punch), `[buildings]` (camera-building
list), `[envcells]` (cells + tris + pool stats).
- `ACDREAM_A8_AUDIT=1` — one-shot per (cellId, gfxObjId) pair dump of
render data: batches count, total IndexCount, CullModes encountered,
IsTransparent + IsAdditive flags, BindlessTextureHandle == 0 count.
Sample audit data captured in gate-#2 (`a8-visual-gate-2.log`):
```
[a8-audit] cell=0xA9B4013F gfx=0x7F852B220B93AD instances=1 isSetup=False batches=4 totalIdx=144 cull=[Landblock] translucent=0 additive=0 zeroHandle=0
```
Every cell mesh batch has CullMode=Landblock (uniform). Render data
loads correctly (no nulls, no zero handles).
Sample [draworder] data captured in gate-#5 (`a8-visual-gate-5.log`):
```
[draworder] frame=155 step=3 stencil=off depthFn=0x201 depthMask=True cull=off(back) blend=0x302/0x303 sFunc=0x207:1:0xFF sOp=0x1E00/0x1E00/0x1E01 sMask=0x1 cMask=(RGB-) vao=0 prog=6
```
Cull is OFF at Step 3 entry (Step 1's `gl.Disable(EnableCap.CullFace)`
already disabled it; my cull-restore-at-exit revert had no effect on
incoming state).
## Root-cause analysis — why the speculative fixes can't close it
### Theory A: AC's polygon winding requires `glFrontFace(CW)`
WB sets `glFrontFace(GLEnum.CW)` globally at
[GameScene.cs:843](references/WorldBuilder/Chorizite.OpenGLSDLBackend/GameScene.cs:843).
Our `WbDrawDispatcher.cs:1056` sets `glFrontFace(CCW)` in the transparent
pass with a comment claiming "our fan triangulation emits pos-side polys
as (0, i, i+1) — CCW." But the actual triangulation in
`BuildCellStructPolygonIndices` ([ObjectMeshManager.cs:1518-1586](src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs:1518))
emits `(i, i-1, 0)` — the REVERSE of (0, i, i+1). The comment is wrong
about our actual winding.
If AC's polys are wound CCW from their PosSurface side (the "front" side
in retail convention), our triangulation produces CW-from-PosSurface
triangles. WB's `FrontFace=CW` makes CW = front, so cull-back removes
the back side correctly. Our `FrontFace=CCW` makes CCW = front, so
cull-back removes the WRONG side — hiding polys whose PosSurface is
camera-facing.
**Verification approach**: change `FrontFace` to CW globally (matching
WB at GameScene.cs:843) and audit every consumer (sky, particles, UI,
translucent crystal mesh) for impact. The dispatcher's CCW set at
line 1056 has a comment about a Phase 9.2 fix (lifestone crystal
see-through-hollow-interior) — that fix might have papered over the
underlying FrontFace mismatch instead of fixing it properly.
**Risk**: changing FrontFace globally might re-introduce the
hollow-interior bug for closed-shell translucent meshes. Needs careful
audit and possibly per-renderer FrontFace push/pop.
### Theory B: Cell polys' floor is filtered out at upload time
`PrepareCellStructMeshData` ([ObjectMeshManager.cs:1295-1306](src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs:1295)):
```csharp
if (!poly.Stippling.HasFlag(StipplingType.NoPos))
AddSurfaceToBatch(poly, poly.PosSurface, false);
bool hasNeg = poly.Stippling.HasFlag(StipplingType.Negative) ||
poly.Stippling.HasFlag(StipplingType.Both) ||
(!poly.Stippling.HasFlag(StipplingType.NoNeg) && poly.SidesType == CullMode.Clockwise);
if (hasNeg)
AddSurfaceToBatch(poly, poly.NegSurface, true);
```
For a floor poly with `Stippling=NoPos + SidesType=Landblock + no
Negative/Both flag`, NEITHER side is uploaded → no rendering at all.
Plausible if AC encodes floor polys this way.
**Verification approach**: dump per-poly Stippling + SidesType + PosSurface
+ NegSurface values for cells. Add to the audit probe.
### Theory C: cottage shell has no floor poly + cell mesh's floor is broken
In retail AC, the cottage's "shell" GfxObj (from `info.Buildings[i].ModelId`)
contains walls + roof + door frame. The floor is provided entirely by the
cell's CellStruct PosSurface polygons. If our cell mesh's floor poly is
broken (winding, missing, wrong texture), nothing else fills in.
**Verification approach**: run WB's executable against the same dat,
take a screenshot from the same camera position inside the same cottage,
diff against our screenshot. Identifies whether the floor source is
the cell mesh or somewhere else.
## Process retrospective — what worked this session
1. **Audit BEFORE apparatus**: line-by-line read of EnvCellRenderer vs
WB source found the pool bug in 30 min. The handoff doc warned about
subagent-written code never being audited; that was the right warning.
2. **Apparatus shipped alongside fix**: GL state probe + audit dumps
captured concrete data that informed subsequent fixes. Gates #1-#5
all relied on probe data, not pure visual.
3. **Stopping after 4 fixes**: per systematic-debugging skill. The
alternative (a 6th speculative attempt) would have either burned more
user testing cycles or shipped another band-aid.
## What this session did NOT do (in scope for next session)
- Match WB's `glFrontFace(CW)` globally + audit consumers.
- Inspect per-poly Stippling/SidesType for cell floors.
- WB renderer side-by-side comparison.
- Investigate purple tint on walls (lighting / scene UBO).
- Investigate texture warping (UV / sampler issues).
- Investigate flickering (Z-fighting / alpha threshold).
- Remove the ACDREAM_A8_INDOOR_BRANCH kill-switch (still needed; default
OFF restores pre-A8 behavior).
## Pickup prompt for next session
> Phase A8 indoor branch is partially working as of `d5deeb3`. Pool
> aliasing root cause is fixed. Sky-through-windows, LiveDynamic chars,
> cell-mesh double-sided rendering all work. But the floor is transparent
> (sky visible through it), textures warp, and the scene has residual
> distortion + flickering.
>
> Read this doc end-to-end. Then pick ONE of the three theories above
> and verify before any code change:
>
> 1. **Theory A (FrontFace=CW)**: highest-leverage. WB sets CW globally;
> we set CCW. Audit translucent crystal + sky shaders' winding
> assumption first. If safe, set FrontFace=CW globally and visual-gate.
>
> 2. **Theory B (cell-poly filtered)**: extend the existing
> `ACDREAM_A8_AUDIT=1` probe to dump per-poly Stippling + SidesType
> + PosSurface/NegSurface for a few cells. Live-capture data; check
> if any floor poly is "no upload" per the conditional.
>
> 3. **Theory C (WB side-by-side)**: build WB's executable from
> `references/WorldBuilder/`, point at same dat dir, screenshot same
> cottage interior. Compare. Confirms or rules out our cell mesh
> upload as the source of the bug.
>
> The kill-switch (`ACDREAM_A8_INDOOR_BRANCH=1`) remains the way to
> reproduce the indoor branch. Pre-A8 behavior (kill-switch unset) is
> still the default and unchanged.
>
> User authorization: "use superpowers but DONT stop me for questions,
> be perfect, no bandaids." The "no bandaids" rule is why this session
> stopped at fix #5 and wrote the handoff instead of attempting fix #6.
> Carry that discipline forward.