acdream/docs/research/2026-06-06-indoor-render-hang-rootcause.md
Erik 1405dd8e90 feat(render): indoor render WORKS — terminating portal flood + every-cell seal + look-in FPS
Checkpoint of the unified retail-faithful indoor render. The two-week HANG/grey is fixed and the
interior seals (live-verified by the user). Commits the session render-rewrite foundation together
with the fixes that made it functional.

- HANG fix: PortalVisibilityBuilder.Build portal flood did not terminate (the faithful ProjectToClip
  near-side clip drifts per round, defeating the CellView dedup; the BFS had no bound after U.2a removed
  MaxReprocessPerCell). Fix = drift-tolerant snapped/canonical CellView.Add dedup (PortalView.cs) plus
  restored MaxReprocessPerCell=16 bounded re-enqueue (PortalVisibilityBuilder.cs). Re-enqueue is kept
  (load-bearing for late-slice propagation, Build_ViewGrowthAfterDoneCell_PropagatesNewSlicesToExit);
  only its count is capped. CellViewDedupTests added.
- Seal (DrawCells Task 2): RetailPViewRenderer.DrawEnvCellShells draws EVERY visible cell via
  IndoorDrawPlan.ShellPass (was gated on the ClipFrameAssembler slot filter, leaving slot-less cells grey).
- Look-in FPS: GameWindow exterior look-in candidates limited to the player landblock +-1 (was all ~81
  loaded LBs iterated every outdoor frame). No behaviour change (far cells were >48m, already culled).

Remaining dominant issue = the FLAP at transitions: viewer-cell metastability (render roots at the
camera-eye cell, which oscillates outdoor-indoor as the 3rd-person boom drifts across the doorway,
confirmed in render-sig). SEPARATE fix, NOT the DrawCells port. Full handoff + flap fix plan + tracked
follow-ups (#78 terrain, look-in-from-inside, look-in FPS, L-spotlight):
docs/research/2026-06-07-indoor-render-session-handoff.md.

Baselines: build 0 err; App.Tests 210/210; Core.Tests 1331 pass / 4 fail (pre-existing) / 1 skip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 10:14:43 +02:00

12 KiB
Raw Blame History

Indoor render HANG — root cause: PortalVisibilityBuilder.Build non-termination — 2026-06-06

Report-only investigation (user chose "investigate more first"). No code changed. Worktree thirsty-goldberg-51bb9b. This blocks the verbatim-DrawCells port's Task 2 visual gate: every indoor frame can freeze here.

Symptom

Three launches of the client all froze (AppHangB1, Windows Event Log) within seconds-to-minutes of the camera being indoors at the Holtburg cottage. Not a crash — no access violation, no managed exception. The captured managed stack of the frozen render thread (hang-stack.txt, via dotnet-stack) shows it CPU-spinning:

CPU_TIME
CellView.Add(ViewPolygon)
PortalVisibilityBuilder.AddRegion(CellView, List<ViewPolygon>)
PortalVisibilityBuilder.Build(...)
RetailPViewRenderer.DrawInside(...)
GameWindow.OnRender(...)

App.Tests 207/207 and Core 1331/4/1 are green; the bug is invisible to the suite (see §Evidence).

Verdict

It is NOT Task 2 (the verbatim-DrawCells / grey fix). Build(...) runs at the very top of DrawInside (RetailPViewRenderer.cs:43), before any line Task 2 touched, and the call is byte-identical pre/post-change. Task 2's draw logic was independently confirmed correct in the run-1 log: [render-sig] draw=[…] equalled ids=[…] with miss=[], and [shell] showed every visible cell drawing textured (zh=0). The grey fix works.

Root cause: PortalVisibilityBuilder.Build's portal BFS does not terminate for real cottage geometry. It re-enqueues a popped cell every time that cell's CellView grows: queued.Remove(cell.CellId) on pop (:122)

  • if (grew && queued.Add(neighbourId)) on grow (:289). Termination therefore depends entirely on growth stopping. Growth is gated only by CellView.Add's exact-match dedup (SamePolygon, eps 1e-4, PortalView.cs:79). The near-side portal clip (ClipPortalAgainstViewPortalProjection.ProjectToClipClipToRegion, :474/:485) produces a polygon that is a hair different on each A↔B reciprocal round (float drift through the homogeneous project→clip round-trip with a non-identity cell transform). The dedup never matches the drifted near-duplicate → the region grows without bound → the cell re-enqueues forever → CellView.Polygons grows to N, and CellView.Add's O(N) dedup scan makes the whole thing O(N²) → frozen.

Evidence

  1. Captured stack pins the spin to CellView.Add ← AddRegion ← Build, pure managed CPU_TIME (not a GL call, not blocked, not a fault).
  2. The code already documents this exact failure at :694-697: the reciprocal clip deliberately stays on the float-stable ProjectToNdc path because "per-round float drift defeated the CellView SamePolygon dedup, inflating a tight A<->B reciprocal view to ~4x its area." The near-side clip (:474) did not get the same treatment — it uses ProjectToClip.
  3. The only bound was removed this session. :74: "Fixpoint termination replacing the old MaxReprocessPerCell hard cap." The fixpoint never converges under drift; with the cap gone there is no other bound (no iteration cap, no max-polygon cap, no time bound).
  4. It's the dirty-tree rewire the handoff said to KEEP. git diff --stat: PortalVisibilityBuilder.cs +426/45 and PortalProjection.cs +111 are uncommitted. ProjectToClip is part of the new PortalProjection lines. The handoff (2026-06-06-verbatim-drawcells-port-pickup-handoff.md) lists this rewire as the faithful foundation to preserve and says "the clip math is already faithful — do not harden the w-clip." The clip is faithful in the picture it computes; it is the non-termination that is broken.
  5. Why the suite is green: PortalVisibilityBuilderTests build cells with WorldTransform = Matrix4x4.Identity and axis-aligned quads in 2-cell chains (cam → ground → exit). No A↔B cycle, no transform-induced drift → the project→clip round-trip is exact → the dedup collapses duplicates → the BFS converges. The real cottage is a cyclic cell cluster (0x016F0x0175, mutual portals) with non-identity transforms → drift + cycle → non-termination. The suite cannot reach the failing case.
  6. Why run 1 survived 113 frames then froze: Build converges at most camera poses; only specific poses create the non-converging drift cycle. The freeze coincided with the metastable doorway flip ([render-sig] stable went 39→0, visible-cell count 5→4) one frame before the log ended.

Hypotheses (ranked)

  1. (confirmed) Non-terminating BFS: re-enqueue-on-grow + ProjectToClip drift defeats the SamePolygon dedup → unbounded CellView growth. Falsify: a re-process cap, a drift-tolerant dedup, or ProjectToNdc on the near-side clip all make Build terminate.
  2. (ruled out) GPU/driver hang from a malformed draw — the stack is pure managed CPU_TIME in CellView.Add, never a GL call; no fault.
  3. (ruled out) Probe-output stdout saturation — disproven: the probe-free run also hung.
  4. (ruled out) Task 2 — Build is upstream of every Task 2 line and unchanged by it.

Fix options (all additive — none reverts the dirty tree)

Fix Touches Pro Con
A (rec.) Drift-tolerant dedup: round clipped polygon vertices to a small grid (≈1e-3) before AddRegion, or widen/snaps SamePolygon's match, so near-duplicates collapse → growth converges. CellView/AddRegion Fixes the actual root cause ("drift defeats dedup"); keeps the faithful ProjectToClip; preserves growth-propagation. ~10 lines. Tolerance is a tuning constant (pick conservatively; over-merge = minor over-tighten).
B Restore a re-process bound (MaxReprocessPerCell-style cap on the BFS). Build loop Smallest; guarantees termination; doesn't touch clip. A guard, not a root fix; may under-include a late-growing view. The user's "no workarounds" rule applies — this is the band-aid.
C Near-side clip on ProjectToNdc (what the reciprocal clip already uses). ClipPortalAgainstView Removes the drift source directly; consistent with :694. Steps on this session's homogeneous near-eye clip work; the handoff's "don't harden the w-clip" is closest to here.

Recommended next step: approve A (drift-tolerant dedup) — it closes the precise mechanism the code half-acknowledges at :694, terminates structurally, and leaves the faithful clip path intact. Implement in a follow-up (not report-only) session, then re-run the Task 2 visual gate (probe-free) at the cottage + cellar.

What this is NOT

  • NOT Task 2 / the grey fix — that is verified working (draw==ids, miss=[], textured shells).
  • NOT a wrong-pixels / unfaithful-projection bug — it's a termination bug. The handoff's "the clip math is faithful, don't harden the w-clip" is about projection correctness; this is BFS convergence. Don't chase the w-clip.
  • NOT a GPU/shader/driver hang and NOT the probe firehose (both ruled out by the stack and the probe-free repro).

Reassessment — is the dirty-tree builder rewire sound? (post Option A)

Option A (drift-tolerant CellView.Add dedup, CellViewDedupTests green) was implemented and the client relaunched. Result: the hang moved out of CellView.Add (A worked for its target) but relocated to ScreenPolygonClip.ClipByEdge via ApplyReciprocalClip (second captured stack, hang-stack2.txt). ScreenPolygonClip.Intersect/ClipByEdge are both bounded for loops — they cannot spin on one call — so the spin is the outer Build BFS still not terminating and calling them a runaway number of times. Option A is necessary but not sufficient.

Git evidence (what the dirty rewire changed re: termination)

  • HEAD (committed) near-side portal clip = PortalProjection.ProjectToNdc (float-stable; git show HEAD: line 146). The dirty rewire switched it to ProjectToClip (ClipPortalAgainstView, dirty line 474) — the homogeneous near-eye clip, introduced to fix the near/grazing-doorway flap/void.
  • The MaxReprocessPerCell hard cap was removed earlier (committed Phase U.2a d880775), replaced by "fixpoint termination." Neither HEAD nor the dirty tree has a hard iteration bound.
  • The dirty rewire's own comment (PortalVisibilityBuilder.cs:519-522) documents that ProjectToClip "produced per-round float drift that defeated the CellView SamePolygon dedup" — and applied that lesson only to the reciprocal clip (kept on ProjectToNdc), leaving the near-side clip on the drift-prone ProjectToClip.

Soundness verdict

The builder's termination model is unsound by construction. It relies on the clipped regions reaching a geometric fixpoint — re-clipping a cell's view reproduces exactly-equal polygons that the dedup recognises — with no hard iteration bound. That only holds if the clip is float-stable. ProjectToClip (needed for faithful near-doorway projection) injects per-round drift, so re-clipping never reproduces an exactly-equal polygon, the dedup never catches it, and the re-enqueue-on-grow flood never converges → infinite loop. You cannot have BOTH faithful near-doorway projection (ProjectToClip) AND convergence-via-exact-dedup-without-a-bound. HEAD got away with it because ProjectToNdc was stable enough to converge (and it sealed — user-verified); the dirty switch tipped it into non-termination. The rewire fixed the projection and, apparently never having been launched, shipped a hang.

A's drift-tolerant dedup narrows the gap but cannot close it: for some geometry the per-round drift exceeds any fixed snap grid, so growth still produces new keys forever. Only a hard bound guarantees termination.

Paths (for the user to choose)

Path Termination Projection fidelity Risk
1 (rec.) Keep ProjectToClip + add enqueue-once bound (D) — the builder's own comment already calls enqueue-once "the hard termination guarantee"; the re-enqueue-on-grow is the bug. Keep A. Guaranteed (≤N pops) Full (faithful doorway clip kept) Minor under-inclusion of late growth → visual-verify; widen to a cap if needed
2 Keep ProjectToClip + add a re-process cap (B, restore MaxReprocessPerCell). Keep A. Guaranteed (≤N×K) Full Less faithful than enqueue-once; a tuning constant
3 Revert the near-side ProjectToClip → ProjectToNdc (back to HEAD). Restored (HEAD converged) Loses the rewire's near-doorway fix → reintroduces the flap/void (separate bug) Throws away this session's projection work; contradicts the keep-the-dirty-tree directive

A bound (paths 1/2) is the sound fix: it makes termination independent of clip drift, so the faithful ProjectToClip projection AND guaranteed termination coexist. Recommendation: path 1 (enqueue-once + keep A), visual-verify for under-inclusion. Reverting (path 3) only trades the hang back for the flap/void.