Checkpoint of the unified retail-faithful indoor render. The two-week HANG/grey is fixed and the interior seals (live-verified by the user). Commits the session render-rewrite foundation together with the fixes that made it functional. - HANG fix: PortalVisibilityBuilder.Build portal flood did not terminate (the faithful ProjectToClip near-side clip drifts per round, defeating the CellView dedup; the BFS had no bound after U.2a removed MaxReprocessPerCell). Fix = drift-tolerant snapped/canonical CellView.Add dedup (PortalView.cs) plus restored MaxReprocessPerCell=16 bounded re-enqueue (PortalVisibilityBuilder.cs). Re-enqueue is kept (load-bearing for late-slice propagation, Build_ViewGrowthAfterDoneCell_PropagatesNewSlicesToExit); only its count is capped. CellViewDedupTests added. - Seal (DrawCells Task 2): RetailPViewRenderer.DrawEnvCellShells draws EVERY visible cell via IndoorDrawPlan.ShellPass (was gated on the ClipFrameAssembler slot filter, leaving slot-less cells grey). - Look-in FPS: GameWindow exterior look-in candidates limited to the player landblock +-1 (was all ~81 loaded LBs iterated every outdoor frame). No behaviour change (far cells were >48m, already culled). Remaining dominant issue = the FLAP at transitions: viewer-cell metastability (render roots at the camera-eye cell, which oscillates outdoor-indoor as the 3rd-person boom drifts across the doorway, confirmed in render-sig). SEPARATE fix, NOT the DrawCells port. Full handoff + flap fix plan + tracked follow-ups (#78 terrain, look-in-from-inside, look-in FPS, L-spotlight): docs/research/2026-06-07-indoor-render-session-handoff.md. Baselines: build 0 err; App.Tests 210/210; Core.Tests 1331 pass / 4 fail (pre-existing) / 1 skip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
12 KiB
Indoor render HANG — root cause: PortalVisibilityBuilder.Build non-termination — 2026-06-06
Report-only investigation (user chose "investigate more first"). No code changed. Worktree
thirsty-goldberg-51bb9b. This blocks the verbatim-DrawCells port's Task 2 visual gate: every indoor frame can freeze here.
Symptom
Three launches of the client all froze (AppHangB1, Windows Event Log) within
seconds-to-minutes of the camera being indoors at the Holtburg cottage. Not a crash —
no access violation, no managed exception. The captured managed stack of the frozen
render thread (hang-stack.txt, via dotnet-stack) shows it CPU-spinning:
CPU_TIME
CellView.Add(ViewPolygon)
PortalVisibilityBuilder.AddRegion(CellView, List<ViewPolygon>)
PortalVisibilityBuilder.Build(...)
RetailPViewRenderer.DrawInside(...)
GameWindow.OnRender(...)
App.Tests 207/207 and Core 1331/4/1 are green; the bug is invisible to the suite (see §Evidence).
Verdict
It is NOT Task 2 (the verbatim-DrawCells / grey fix). Build(...) runs at the very
top of DrawInside (RetailPViewRenderer.cs:43),
before any line Task 2 touched, and the call is byte-identical pre/post-change. Task 2's
draw logic was independently confirmed correct in the run-1 log: [render-sig] draw=[…]
equalled ids=[…] with miss=[], and [shell] showed every visible cell drawing textured
(zh=0). The grey fix works.
Root cause: PortalVisibilityBuilder.Build's portal BFS does not terminate for real
cottage geometry. It re-enqueues a popped cell every time that cell's CellView grows:
queued.Remove(cell.CellId) on pop (:122)
if (grew && queued.Add(neighbourId))on grow (:289). Termination therefore depends entirely on growth stopping. Growth is gated only byCellView.Add's exact-match dedup (SamePolygon, eps1e-4, PortalView.cs:79). The near-side portal clip (ClipPortalAgainstView→PortalProjection.ProjectToClip→ClipToRegion, :474/:485) produces a polygon that is a hair different on eachA↔Breciprocal round (float drift through the homogeneous project→clip round-trip with a non-identity cell transform). The dedup never matches the drifted near-duplicate → the region grows without bound → the cell re-enqueues forever →CellView.Polygonsgrows to N, andCellView.Add's O(N) dedup scan makes the whole thing O(N²) → frozen.
Evidence
- Captured stack pins the spin to
CellView.Add ← AddRegion ← Build, pure managedCPU_TIME(not a GL call, not blocked, not a fault). - The code already documents this exact failure at
:694-697: the reciprocal
clip deliberately stays on the float-stable
ProjectToNdcpath because "per-round float drift defeated the CellView SamePolygon dedup, inflating a tight A<->B reciprocal view to ~4x its area." The near-side clip (:474) did not get the same treatment — it usesProjectToClip. - The only bound was removed this session. :74:
"Fixpoint termination replacing the old
MaxReprocessPerCellhard cap." The fixpoint never converges under drift; with the cap gone there is no other bound (no iteration cap, no max-polygon cap, no time bound). - It's the dirty-tree rewire the handoff said to KEEP.
git diff --stat:PortalVisibilityBuilder.cs +426/−45andPortalProjection.cs +111are uncommitted.ProjectToClipis part of the newPortalProjectionlines. The handoff (2026-06-06-verbatim-drawcells-port-pickup-handoff.md) lists this rewire as the faithful foundation to preserve and says "the clip math is already faithful — do not harden the w-clip." The clip is faithful in the picture it computes; it is the non-termination that is broken. - Why the suite is green:
PortalVisibilityBuilderTestsbuild cells withWorldTransform = Matrix4x4.Identityand axis-aligned quads in 2-cell chains (cam → ground → exit). NoA↔Bcycle, no transform-induced drift → the project→clip round-trip is exact → the dedup collapses duplicates → the BFS converges. The real cottage is a cyclic cell cluster (0x016F–0x0175, mutual portals) with non-identity transforms → drift + cycle → non-termination. The suite cannot reach the failing case. - Why run 1 survived 113 frames then froze:
Buildconverges at most camera poses; only specific poses create the non-converging drift cycle. The freeze coincided with the metastable doorway flip ([render-sig] stablewent 39→0, visible-cell count 5→4) one frame before the log ended.
Hypotheses (ranked)
- (confirmed) Non-terminating BFS: re-enqueue-on-grow +
ProjectToClipdrift defeats theSamePolygondedup → unboundedCellViewgrowth. Falsify: a re-process cap, a drift-tolerant dedup, orProjectToNdcon the near-side clip all makeBuildterminate. - (ruled out) GPU/driver hang from a malformed draw — the stack is pure managed
CPU_TIMEinCellView.Add, never a GL call; no fault. - (ruled out) Probe-output stdout saturation — disproven: the probe-free run also hung.
- (ruled out) Task 2 —
Buildis upstream of every Task 2 line and unchanged by it.
Fix options (all additive — none reverts the dirty tree)
| Fix | Touches | Pro | Con | |
|---|---|---|---|---|
| A (rec.) | Drift-tolerant dedup: round clipped polygon vertices to a small grid (≈1e-3) before AddRegion, or widen/snaps SamePolygon's match, so near-duplicates collapse → growth converges. |
CellView/AddRegion |
Fixes the actual root cause ("drift defeats dedup"); keeps the faithful ProjectToClip; preserves growth-propagation. ~10 lines. |
Tolerance is a tuning constant (pick conservatively; over-merge = minor over-tighten). |
| B | Restore a re-process bound (MaxReprocessPerCell-style cap on the BFS). |
Build loop |
Smallest; guarantees termination; doesn't touch clip. | A guard, not a root fix; may under-include a late-growing view. The user's "no workarounds" rule applies — this is the band-aid. |
| C | Near-side clip on ProjectToNdc (what the reciprocal clip already uses). |
ClipPortalAgainstView |
Removes the drift source directly; consistent with :694. |
Steps on this session's homogeneous near-eye clip work; the handoff's "don't harden the w-clip" is closest to here. |
Recommended next step: approve A (drift-tolerant dedup) — it closes the precise
mechanism the code half-acknowledges at :694, terminates structurally, and leaves the
faithful clip path intact. Implement in a follow-up (not report-only) session, then re-run the
Task 2 visual gate (probe-free) at the cottage + cellar.
What this is NOT
- NOT Task 2 / the grey fix — that is verified working (
draw==ids,miss=[], textured shells). - NOT a wrong-pixels / unfaithful-projection bug — it's a termination bug. The handoff's "the clip math is faithful, don't harden the w-clip" is about projection correctness; this is BFS convergence. Don't chase the w-clip.
- NOT a GPU/shader/driver hang and NOT the probe firehose (both ruled out by the stack and the probe-free repro).
Reassessment — is the dirty-tree builder rewire sound? (post Option A)
Option A (drift-tolerant CellView.Add dedup, CellViewDedupTests green) was implemented and the
client relaunched. Result: the hang moved out of CellView.Add (A worked for its target) but
relocated to ScreenPolygonClip.ClipByEdge via ApplyReciprocalClip (second captured stack,
hang-stack2.txt). ScreenPolygonClip.Intersect/ClipByEdge are both bounded for loops —
they cannot spin on one call — so the spin is the outer Build BFS still not terminating and
calling them a runaway number of times. Option A is necessary but not sufficient.
Git evidence (what the dirty rewire changed re: termination)
- HEAD (committed) near-side portal clip =
PortalProjection.ProjectToNdc(float-stable;git show HEAD:line 146). The dirty rewire switched it toProjectToClip(ClipPortalAgainstView, dirty line 474) — the homogeneous near-eye clip, introduced to fix the near/grazing-doorway flap/void. - The
MaxReprocessPerCellhard cap was removed earlier (committed Phase U.2ad880775), replaced by "fixpoint termination." Neither HEAD nor the dirty tree has a hard iteration bound. - The dirty rewire's own comment (
PortalVisibilityBuilder.cs:519-522) documents thatProjectToClip"produced per-round float drift that defeated the CellView SamePolygon dedup" — and applied that lesson only to the reciprocal clip (kept onProjectToNdc), leaving the near-side clip on the drift-proneProjectToClip.
Soundness verdict
The builder's termination model is unsound by construction. It relies on the clipped regions
reaching a geometric fixpoint — re-clipping a cell's view reproduces exactly-equal polygons that the
dedup recognises — with no hard iteration bound. That only holds if the clip is float-stable.
ProjectToClip (needed for faithful near-doorway projection) injects per-round drift, so re-clipping
never reproduces an exactly-equal polygon, the dedup never catches it, and the re-enqueue-on-grow flood
never converges → infinite loop. You cannot have BOTH faithful near-doorway projection (ProjectToClip)
AND convergence-via-exact-dedup-without-a-bound. HEAD got away with it because ProjectToNdc was
stable enough to converge (and it sealed — user-verified); the dirty switch tipped it into non-termination.
The rewire fixed the projection and, apparently never having been launched, shipped a hang.
A's drift-tolerant dedup narrows the gap but cannot close it: for some geometry the per-round drift exceeds any fixed snap grid, so growth still produces new keys forever. Only a hard bound guarantees termination.
Paths (for the user to choose)
| Path | Termination | Projection fidelity | Risk | |
|---|---|---|---|---|
| 1 (rec.) | Keep ProjectToClip + add enqueue-once bound (D) — the builder's own comment already calls enqueue-once "the hard termination guarantee"; the re-enqueue-on-grow is the bug. Keep A. |
Guaranteed (≤N pops) | Full (faithful doorway clip kept) | Minor under-inclusion of late growth → visual-verify; widen to a cap if needed |
| 2 | Keep ProjectToClip + add a re-process cap (B, restore MaxReprocessPerCell). Keep A. |
Guaranteed (≤N×K) | Full | Less faithful than enqueue-once; a tuning constant |
| 3 | Revert the near-side ProjectToClip → ProjectToNdc (back to HEAD). |
Restored (HEAD converged) | Loses the rewire's near-doorway fix → reintroduces the flap/void (separate bug) | Throws away this session's projection work; contradicts the keep-the-dirty-tree directive |
A bound (paths 1/2) is the sound fix: it makes termination independent of clip drift, so the faithful
ProjectToClip projection AND guaranteed termination coexist. Recommendation: path 1 (enqueue-once +
keep A), visual-verify for under-inclusion. Reverting (path 3) only trades the hang back for the
flap/void.