diff --git a/docs/research/2026-05-28-a8-session-2-shipped-and-handoff.md b/docs/research/2026-05-28-a8-session-2-shipped-and-handoff.md new file mode 100644 index 0000000..86641b7 --- /dev/null +++ b/docs/research/2026-05-28-a8-session-2-shipped-and-handoff.md @@ -0,0 +1,218 @@ +# Phase A8 — Session 2: pool fix shipped, 4 more fixes shipped, residual visuals remain (2026-05-28 PM) + +## TL;DR for next session + +The session-1 handoff said "BUILD APPARATUS, NOT MORE SPECULATIVE FIXES." I +built apparatus (per-step GL state probe + per-cell mesh audit + pool +diagnostics) AND, before the apparatus was used, line-by-line audited +`EnvCellRenderer.cs` against WB source. The audit found **two +high-confidence bugs** (pool aliasing) in 30 minutes — these were the +root cause of the post-Wave-5 catastrophic visual chaos. Pool fix shipped +(`9559726`) and the visual went from "thin black diagonal sliver, GPU +100%, 10 FPS, can't see anything" to "walls + objects + sky render +cleanly, FPS normal." + +Five more targeted fixes shipped across visual gates #1-#5. The first +four landed real bugs. The fifth (cull-restore revert) was based on a +hypothesis the [draworder] probe data invalidated — gate-#5 showed cull +state was already off at Step 3 before EnvCellRenderer.Render ran, so +the propagation theory didn't apply. + +**Per systematic-debugging skill's `≥3-failures → question architecture` +rule, I stopped and wrote this handoff rather than ship a 6th speculative +fix.** The remaining symptoms (transparent floor, texture warping, +distortion) point to architectural-level issues that need a different +investigation approach. + +## Visual progress chronicle + +| Gate | Symptoms reported | Cause if known | +|------|-------------------|---| +| Pre-session (from session-1 handoff) | "Thin black diagonal sliver, GPU 100%, 10 FPS, can't see anything" | Pool aliasing (cleared by session-2 commit `9559726`) | +| Gate #1 (`375f9a7` + sky-fix not yet) | Walls + objects render, no flicker, FPS normal. No sky through windows. Char + doors missing. Floor missing. Purple tint on walls. | Pool fixed (huge win). LiveDynamic/sky/cull not yet addressed. | +| Gate #2 (sky fix + audit probe) | Sky visible through windows ✓. Char + doors still missing. Floor still missing. Purple still. | Sky fix worked. Audit dumped per-cell render data. | +| Gate #3 (LiveDynamic + cull-disable A/B) | Char + doors visible ✓. Floor sometimes visible. See-through-head (cull-off side effect). | LiveDynamic fix worked. Cull-disable proved cull was hiding floor. | +| Gate #4 (Landblock→None + cull-restore) | "BROKEN textures, floor is now transparent" — sky visible through floor | Cull-restore at exit propagated cull-back to dispatcher's IndoorPass, culling cottage shell's floor poly. | +| Gate #5 (revert cull-restore) | "No change at all, textures warped, missing textures, floors transparent and flickering" | Revert didn't help — [draworder] probe shows cull was already off at Step 3 entry, so removing my cull-restore at exit doesn't change inherited state. | + +## What's shipped this session + +| SHA | Description | Status | +|-----|-------------|--------| +| `9559726` | Pool aliasing root cause fix (Clear + PostPreparePoolIndex + nested-Setup detection) + 4 regression tests + audit findings doc | **KEPT — closes the post-Wave-5 chaos** | +| `375f9a7` | Full GL state probe + pool diagnostics extension (option-1 apparatus) | **KEPT — apparatus** | +| `772d69c` | Sky-when-cameraInsideBuilding fix + per-cell audit probe | **KEPT — sky through windows works** | +| `b19f3c1` | LiveDynamic dispatcher call in indoor branch + ACDREAM_A8_DISABLE_CULL A/B gate | **KEPT — chars + doors visible inside** | +| `0940d79` | Cell-mesh Landblock CullMode → None + cull-state restore at exit | **PARTIALLY KEPT — Landblock→None is good; cull-restore was wrong (reverted in d5deeb3)** | +| `d5deeb3` | Revert cull-restore at EnvCellRenderer exit | **KEPT — leaves cull-off propagating** | + +## What's still wrong (visual gate #5 state) + +User-reported symptoms with kill-switch ON (`ACDREAM_A8_INDOOR_BRANCH=1`): + +1. **Floor transparent** — sky color visible where floor should be. Cell + mesh has Landblock→None override that should render cell polys + double-sided, but the floor poly either (a) isn't in the upload, (b) + has wrong winding/orientation, or (c) is being rendered but z-fails or + alpha-discards. + +2. **Texture warping** — vague but visible in screenshots. Some surfaces + show wrong texture or texture appears stretched/distorted. + +3. **Flickering** — surfaces alternate between visible/invisible across + frames. Could be Z-fighting (cell mesh vs cottage shell at same depth), + alpha-test threshold instability, or animated camera causing + per-frame frustum-test results to differ. + +4. **General distortion** — overall scene "looks broken." Possibly purple + tint on lighting (mentioned in gates #1-#3, not explicitly in #5). + +## Apparatus state + +These probes are wired and operate when env vars are set: + +- `ACDREAM_PROBE_VIS=1` — emits `[draworder]` (per-step GL state), + `[stencil]` (per stencil mark/punch), `[buildings]` (camera-building + list), `[envcells]` (cells + tris + pool stats). +- `ACDREAM_A8_AUDIT=1` — one-shot per (cellId, gfxObjId) pair dump of + render data: batches count, total IndexCount, CullModes encountered, + IsTransparent + IsAdditive flags, BindlessTextureHandle == 0 count. + +Sample audit data captured in gate-#2 (`a8-visual-gate-2.log`): +``` +[a8-audit] cell=0xA9B4013F gfx=0x7F852B220B93AD instances=1 isSetup=False batches=4 totalIdx=144 cull=[Landblock] translucent=0 additive=0 zeroHandle=0 +``` +Every cell mesh batch has CullMode=Landblock (uniform). Render data +loads correctly (no nulls, no zero handles). + +Sample [draworder] data captured in gate-#5 (`a8-visual-gate-5.log`): +``` +[draworder] frame=155 step=3 stencil=off depthFn=0x201 depthMask=True cull=off(back) blend=0x302/0x303 sFunc=0x207:1:0xFF sOp=0x1E00/0x1E00/0x1E01 sMask=0x1 cMask=(RGB-) vao=0 prog=6 +``` +Cull is OFF at Step 3 entry (Step 1's `gl.Disable(EnableCap.CullFace)` +already disabled it; my cull-restore-at-exit revert had no effect on +incoming state). + +## Root-cause analysis — why the speculative fixes can't close it + +### Theory A: AC's polygon winding requires `glFrontFace(CW)` + +WB sets `glFrontFace(GLEnum.CW)` globally at +[GameScene.cs:843](references/WorldBuilder/Chorizite.OpenGLSDLBackend/GameScene.cs:843). +Our `WbDrawDispatcher.cs:1056` sets `glFrontFace(CCW)` in the transparent +pass with a comment claiming "our fan triangulation emits pos-side polys +as (0, i, i+1) — CCW." But the actual triangulation in +`BuildCellStructPolygonIndices` ([ObjectMeshManager.cs:1518-1586](src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs:1518)) +emits `(i, i-1, 0)` — the REVERSE of (0, i, i+1). The comment is wrong +about our actual winding. + +If AC's polys are wound CCW from their PosSurface side (the "front" side +in retail convention), our triangulation produces CW-from-PosSurface +triangles. WB's `FrontFace=CW` makes CW = front, so cull-back removes +the back side correctly. Our `FrontFace=CCW` makes CCW = front, so +cull-back removes the WRONG side — hiding polys whose PosSurface is +camera-facing. + +**Verification approach**: change `FrontFace` to CW globally (matching +WB at GameScene.cs:843) and audit every consumer (sky, particles, UI, +translucent crystal mesh) for impact. The dispatcher's CCW set at +line 1056 has a comment about a Phase 9.2 fix (lifestone crystal +see-through-hollow-interior) — that fix might have papered over the +underlying FrontFace mismatch instead of fixing it properly. + +**Risk**: changing FrontFace globally might re-introduce the +hollow-interior bug for closed-shell translucent meshes. Needs careful +audit and possibly per-renderer FrontFace push/pop. + +### Theory B: Cell polys' floor is filtered out at upload time + +`PrepareCellStructMeshData` ([ObjectMeshManager.cs:1295-1306](src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs:1295)): + +```csharp +if (!poly.Stippling.HasFlag(StipplingType.NoPos)) + AddSurfaceToBatch(poly, poly.PosSurface, false); + +bool hasNeg = poly.Stippling.HasFlag(StipplingType.Negative) || + poly.Stippling.HasFlag(StipplingType.Both) || + (!poly.Stippling.HasFlag(StipplingType.NoNeg) && poly.SidesType == CullMode.Clockwise); +if (hasNeg) + AddSurfaceToBatch(poly, poly.NegSurface, true); +``` + +For a floor poly with `Stippling=NoPos + SidesType=Landblock + no +Negative/Both flag`, NEITHER side is uploaded → no rendering at all. +Plausible if AC encodes floor polys this way. + +**Verification approach**: dump per-poly Stippling + SidesType + PosSurface ++ NegSurface values for cells. Add to the audit probe. + +### Theory C: cottage shell has no floor poly + cell mesh's floor is broken + +In retail AC, the cottage's "shell" GfxObj (from `info.Buildings[i].ModelId`) +contains walls + roof + door frame. The floor is provided entirely by the +cell's CellStruct PosSurface polygons. If our cell mesh's floor poly is +broken (winding, missing, wrong texture), nothing else fills in. + +**Verification approach**: run WB's executable against the same dat, +take a screenshot from the same camera position inside the same cottage, +diff against our screenshot. Identifies whether the floor source is +the cell mesh or somewhere else. + +## Process retrospective — what worked this session + +1. **Audit BEFORE apparatus**: line-by-line read of EnvCellRenderer vs + WB source found the pool bug in 30 min. The handoff doc warned about + subagent-written code never being audited; that was the right warning. + +2. **Apparatus shipped alongside fix**: GL state probe + audit dumps + captured concrete data that informed subsequent fixes. Gates #1-#5 + all relied on probe data, not pure visual. + +3. **Stopping after 4 fixes**: per systematic-debugging skill. The + alternative (a 6th speculative attempt) would have either burned more + user testing cycles or shipped another band-aid. + +## What this session did NOT do (in scope for next session) + +- Match WB's `glFrontFace(CW)` globally + audit consumers. +- Inspect per-poly Stippling/SidesType for cell floors. +- WB renderer side-by-side comparison. +- Investigate purple tint on walls (lighting / scene UBO). +- Investigate texture warping (UV / sampler issues). +- Investigate flickering (Z-fighting / alpha threshold). +- Remove the ACDREAM_A8_INDOOR_BRANCH kill-switch (still needed; default + OFF restores pre-A8 behavior). + +## Pickup prompt for next session + +> Phase A8 indoor branch is partially working as of `d5deeb3`. Pool +> aliasing root cause is fixed. Sky-through-windows, LiveDynamic chars, +> cell-mesh double-sided rendering all work. But the floor is transparent +> (sky visible through it), textures warp, and the scene has residual +> distortion + flickering. +> +> Read this doc end-to-end. Then pick ONE of the three theories above +> and verify before any code change: +> +> 1. **Theory A (FrontFace=CW)**: highest-leverage. WB sets CW globally; +> we set CCW. Audit translucent crystal + sky shaders' winding +> assumption first. If safe, set FrontFace=CW globally and visual-gate. +> +> 2. **Theory B (cell-poly filtered)**: extend the existing +> `ACDREAM_A8_AUDIT=1` probe to dump per-poly Stippling + SidesType +> + PosSurface/NegSurface for a few cells. Live-capture data; check +> if any floor poly is "no upload" per the conditional. +> +> 3. **Theory C (WB side-by-side)**: build WB's executable from +> `references/WorldBuilder/`, point at same dat dir, screenshot same +> cottage interior. Compare. Confirms or rules out our cell mesh +> upload as the source of the bug. +> +> The kill-switch (`ACDREAM_A8_INDOOR_BRANCH=1`) remains the way to +> reproduce the indoor branch. Pre-A8 behavior (kill-switch unset) is +> still the default and unchanged. +> +> User authorization: "use superpowers but DONT stop me for questions, +> be perfect, no bandaids." The "no bandaids" rule is why this session +> stopped at fix #5 and wrote the handoff instead of attempting fix #6. +> Carry that discipline forward.