acdream/docs/research/2026-05-28-a8-session-2-shipped-and-handoff.md
Erik e415bb3863 docs: Phase A8 — session 2 handoff (pool fix shipped + 4 partial fixes + residuals)
After 5 visual gates, the session shipped 5 commits closing real bugs
(pool aliasing was the catastrophic root cause), but residual symptoms
(transparent floor, texture warping, flickering, distortion) didn't
yield to surgical fixes. Per systematic-debugging skill's >=3-failures
rule, stop and capture state.

Doc covers:
- Pool aliasing root cause + fix (the big win — closes session-1's
  visual chaos).
- Sky-when-building, LiveDynamic, Landblock→None — all real bug closures.
- Apparatus state (GL state probe + per-cell audit + pool diagnostics).
- Three theories for the residual issues (FrontFace=CW global match to
  WB / per-poly Stippling audit / WB side-by-side render).
- Pickup prompt for next session with ranked options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:32:22 +02:00

12 KiB

Phase A8 — Session 2: pool fix shipped, 4 more fixes shipped, residual visuals remain (2026-05-28 PM)

TL;DR for next session

The session-1 handoff said "BUILD APPARATUS, NOT MORE SPECULATIVE FIXES." I built apparatus (per-step GL state probe + per-cell mesh audit + pool diagnostics) AND, before the apparatus was used, line-by-line audited EnvCellRenderer.cs against WB source. The audit found two high-confidence bugs (pool aliasing) in 30 minutes — these were the root cause of the post-Wave-5 catastrophic visual chaos. Pool fix shipped (9559726) and the visual went from "thin black diagonal sliver, GPU 100%, 10 FPS, can't see anything" to "walls + objects + sky render cleanly, FPS normal."

Five more targeted fixes shipped across visual gates #1-#5. The first four landed real bugs. The fifth (cull-restore revert) was based on a hypothesis the [draworder] probe data invalidated — gate-#5 showed cull state was already off at Step 3 before EnvCellRenderer.Render ran, so the propagation theory didn't apply.

Per systematic-debugging skill's ≥3-failures → question architecture rule, I stopped and wrote this handoff rather than ship a 6th speculative fix. The remaining symptoms (transparent floor, texture warping, distortion) point to architectural-level issues that need a different investigation approach.

Visual progress chronicle

Gate Symptoms reported Cause if known
Pre-session (from session-1 handoff) "Thin black diagonal sliver, GPU 100%, 10 FPS, can't see anything" Pool aliasing (cleared by session-2 commit 9559726)
Gate #1 (375f9a7 + sky-fix not yet) Walls + objects render, no flicker, FPS normal. No sky through windows. Char + doors missing. Floor missing. Purple tint on walls. Pool fixed (huge win). LiveDynamic/sky/cull not yet addressed.
Gate #2 (sky fix + audit probe) Sky visible through windows ✓. Char + doors still missing. Floor still missing. Purple still. Sky fix worked. Audit dumped per-cell render data.
Gate #3 (LiveDynamic + cull-disable A/B) Char + doors visible ✓. Floor sometimes visible. See-through-head (cull-off side effect). LiveDynamic fix worked. Cull-disable proved cull was hiding floor.
Gate #4 (Landblock→None + cull-restore) "BROKEN textures, floor is now transparent" — sky visible through floor Cull-restore at exit propagated cull-back to dispatcher's IndoorPass, culling cottage shell's floor poly.
Gate #5 (revert cull-restore) "No change at all, textures warped, missing textures, floors transparent and flickering" Revert didn't help — [draworder] probe shows cull was already off at Step 3 entry, so removing my cull-restore at exit doesn't change inherited state.

What's shipped this session

SHA Description Status
9559726 Pool aliasing root cause fix (Clear + PostPreparePoolIndex + nested-Setup detection) + 4 regression tests + audit findings doc KEPT — closes the post-Wave-5 chaos
375f9a7 Full GL state probe + pool diagnostics extension (option-1 apparatus) KEPT — apparatus
772d69c Sky-when-cameraInsideBuilding fix + per-cell audit probe KEPT — sky through windows works
b19f3c1 LiveDynamic dispatcher call in indoor branch + ACDREAM_A8_DISABLE_CULL A/B gate KEPT — chars + doors visible inside
0940d79 Cell-mesh Landblock CullMode → None + cull-state restore at exit PARTIALLY KEPT — Landblock→None is good; cull-restore was wrong (reverted in d5deeb3)
d5deeb3 Revert cull-restore at EnvCellRenderer exit KEPT — leaves cull-off propagating

What's still wrong (visual gate #5 state)

User-reported symptoms with kill-switch ON (ACDREAM_A8_INDOOR_BRANCH=1):

  1. Floor transparent — sky color visible where floor should be. Cell mesh has Landblock→None override that should render cell polys double-sided, but the floor poly either (a) isn't in the upload, (b) has wrong winding/orientation, or (c) is being rendered but z-fails or alpha-discards.

  2. Texture warping — vague but visible in screenshots. Some surfaces show wrong texture or texture appears stretched/distorted.

  3. Flickering — surfaces alternate between visible/invisible across frames. Could be Z-fighting (cell mesh vs cottage shell at same depth), alpha-test threshold instability, or animated camera causing per-frame frustum-test results to differ.

  4. General distortion — overall scene "looks broken." Possibly purple tint on lighting (mentioned in gates #1-#3, not explicitly in #5).

Apparatus state

These probes are wired and operate when env vars are set:

  • ACDREAM_PROBE_VIS=1 — emits [draworder] (per-step GL state), [stencil] (per stencil mark/punch), [buildings] (camera-building list), [envcells] (cells + tris + pool stats).
  • ACDREAM_A8_AUDIT=1 — one-shot per (cellId, gfxObjId) pair dump of render data: batches count, total IndexCount, CullModes encountered, IsTransparent + IsAdditive flags, BindlessTextureHandle == 0 count.

Sample audit data captured in gate-#2 (a8-visual-gate-2.log):

[a8-audit] cell=0xA9B4013F gfx=0x7F852B220B93AD instances=1 isSetup=False batches=4 totalIdx=144 cull=[Landblock] translucent=0 additive=0 zeroHandle=0

Every cell mesh batch has CullMode=Landblock (uniform). Render data loads correctly (no nulls, no zero handles).

Sample [draworder] data captured in gate-#5 (a8-visual-gate-5.log):

[draworder] frame=155 step=3 stencil=off depthFn=0x201 depthMask=True cull=off(back) blend=0x302/0x303 sFunc=0x207:1:0xFF sOp=0x1E00/0x1E00/0x1E01 sMask=0x1 cMask=(RGB-) vao=0 prog=6

Cull is OFF at Step 3 entry (Step 1's gl.Disable(EnableCap.CullFace) already disabled it; my cull-restore-at-exit revert had no effect on incoming state).

Root-cause analysis — why the speculative fixes can't close it

Theory A: AC's polygon winding requires glFrontFace(CW)

WB sets glFrontFace(GLEnum.CW) globally at GameScene.cs:843. Our WbDrawDispatcher.cs:1056 sets glFrontFace(CCW) in the transparent pass with a comment claiming "our fan triangulation emits pos-side polys as (0, i, i+1) — CCW." But the actual triangulation in BuildCellStructPolygonIndices (ObjectMeshManager.cs:1518-1586) emits (i, i-1, 0) — the REVERSE of (0, i, i+1). The comment is wrong about our actual winding.

If AC's polys are wound CCW from their PosSurface side (the "front" side in retail convention), our triangulation produces CW-from-PosSurface triangles. WB's FrontFace=CW makes CW = front, so cull-back removes the back side correctly. Our FrontFace=CCW makes CCW = front, so cull-back removes the WRONG side — hiding polys whose PosSurface is camera-facing.

Verification approach: change FrontFace to CW globally (matching WB at GameScene.cs:843) and audit every consumer (sky, particles, UI, translucent crystal mesh) for impact. The dispatcher's CCW set at line 1056 has a comment about a Phase 9.2 fix (lifestone crystal see-through-hollow-interior) — that fix might have papered over the underlying FrontFace mismatch instead of fixing it properly.

Risk: changing FrontFace globally might re-introduce the hollow-interior bug for closed-shell translucent meshes. Needs careful audit and possibly per-renderer FrontFace push/pop.

Theory B: Cell polys' floor is filtered out at upload time

PrepareCellStructMeshData (ObjectMeshManager.cs:1295-1306):

if (!poly.Stippling.HasFlag(StipplingType.NoPos))
    AddSurfaceToBatch(poly, poly.PosSurface, false);

bool hasNeg = poly.Stippling.HasFlag(StipplingType.Negative) ||
             poly.Stippling.HasFlag(StipplingType.Both) ||
             (!poly.Stippling.HasFlag(StipplingType.NoNeg) && poly.SidesType == CullMode.Clockwise);
if (hasNeg)
    AddSurfaceToBatch(poly, poly.NegSurface, true);

For a floor poly with Stippling=NoPos + SidesType=Landblock + no Negative/Both flag, NEITHER side is uploaded → no rendering at all. Plausible if AC encodes floor polys this way.

Verification approach: dump per-poly Stippling + SidesType + PosSurface

  • NegSurface values for cells. Add to the audit probe.

Theory C: cottage shell has no floor poly + cell mesh's floor is broken

In retail AC, the cottage's "shell" GfxObj (from info.Buildings[i].ModelId) contains walls + roof + door frame. The floor is provided entirely by the cell's CellStruct PosSurface polygons. If our cell mesh's floor poly is broken (winding, missing, wrong texture), nothing else fills in.

Verification approach: run WB's executable against the same dat, take a screenshot from the same camera position inside the same cottage, diff against our screenshot. Identifies whether the floor source is the cell mesh or somewhere else.

Process retrospective — what worked this session

  1. Audit BEFORE apparatus: line-by-line read of EnvCellRenderer vs WB source found the pool bug in 30 min. The handoff doc warned about subagent-written code never being audited; that was the right warning.

  2. Apparatus shipped alongside fix: GL state probe + audit dumps captured concrete data that informed subsequent fixes. Gates #1-#5 all relied on probe data, not pure visual.

  3. Stopping after 4 fixes: per systematic-debugging skill. The alternative (a 6th speculative attempt) would have either burned more user testing cycles or shipped another band-aid.

What this session did NOT do (in scope for next session)

  • Match WB's glFrontFace(CW) globally + audit consumers.
  • Inspect per-poly Stippling/SidesType for cell floors.
  • WB renderer side-by-side comparison.
  • Investigate purple tint on walls (lighting / scene UBO).
  • Investigate texture warping (UV / sampler issues).
  • Investigate flickering (Z-fighting / alpha threshold).
  • Remove the ACDREAM_A8_INDOOR_BRANCH kill-switch (still needed; default OFF restores pre-A8 behavior).

Pickup prompt for next session

Phase A8 indoor branch is partially working as of d5deeb3. Pool aliasing root cause is fixed. Sky-through-windows, LiveDynamic chars, cell-mesh double-sided rendering all work. But the floor is transparent (sky visible through it), textures warp, and the scene has residual distortion + flickering.

Read this doc end-to-end. Then pick ONE of the three theories above and verify before any code change:

  1. Theory A (FrontFace=CW): highest-leverage. WB sets CW globally; we set CCW. Audit translucent crystal + sky shaders' winding assumption first. If safe, set FrontFace=CW globally and visual-gate.

  2. Theory B (cell-poly filtered): extend the existing ACDREAM_A8_AUDIT=1 probe to dump per-poly Stippling + SidesType

    • PosSurface/NegSurface for a few cells. Live-capture data; check if any floor poly is "no upload" per the conditional.
  3. Theory C (WB side-by-side): build WB's executable from references/WorldBuilder/, point at same dat dir, screenshot same cottage interior. Compare. Confirms or rules out our cell mesh upload as the source of the bug.

The kill-switch (ACDREAM_A8_INDOOR_BRANCH=1) remains the way to reproduce the indoor branch. Pre-A8 behavior (kill-switch unset) is still the default and unchanged.

User authorization: "use superpowers but DONT stop me for questions, be perfect, no bandaids." The "no bandaids" rule is why this session stopped at fix #5 and wrote the handoff instead of attempting fix #6. Carry that discipline forward.