Checkpoint of the unified retail-faithful indoor render. The two-week HANG/grey is fixed and the interior seals (live-verified by the user). Commits the session render-rewrite foundation together with the fixes that made it functional. - HANG fix: PortalVisibilityBuilder.Build portal flood did not terminate (the faithful ProjectToClip near-side clip drifts per round, defeating the CellView dedup; the BFS had no bound after U.2a removed MaxReprocessPerCell). Fix = drift-tolerant snapped/canonical CellView.Add dedup (PortalView.cs) plus restored MaxReprocessPerCell=16 bounded re-enqueue (PortalVisibilityBuilder.cs). Re-enqueue is kept (load-bearing for late-slice propagation, Build_ViewGrowthAfterDoneCell_PropagatesNewSlicesToExit); only its count is capped. CellViewDedupTests added. - Seal (DrawCells Task 2): RetailPViewRenderer.DrawEnvCellShells draws EVERY visible cell via IndoorDrawPlan.ShellPass (was gated on the ClipFrameAssembler slot filter, leaving slot-less cells grey). - Look-in FPS: GameWindow exterior look-in candidates limited to the player landblock +-1 (was all ~81 loaded LBs iterated every outdoor frame). No behaviour change (far cells were >48m, already culled). Remaining dominant issue = the FLAP at transitions: viewer-cell metastability (render roots at the camera-eye cell, which oscillates outdoor-indoor as the 3rd-person boom drifts across the doorway, confirmed in render-sig). SEPARATE fix, NOT the DrawCells port. Full handoff + flap fix plan + tracked follow-ups (#78 terrain, look-in-from-inside, look-in FPS, L-spotlight): docs/research/2026-06-07-indoor-render-session-handoff.md. Baselines: build 0 err; App.Tests 210/210; Core.Tests 1331 pass / 4 fail (pre-existing) / 1 skip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
163 lines
12 KiB
Markdown
163 lines
12 KiB
Markdown
# Indoor render HANG — root cause: `PortalVisibilityBuilder.Build` non-termination — 2026-06-06
|
||
|
||
> Report-only investigation (user chose "investigate more first"). **No code changed.**
|
||
> Worktree `thirsty-goldberg-51bb9b`. This blocks the verbatim-DrawCells port's Task 2
|
||
> visual gate: every indoor frame can freeze here.
|
||
|
||
## Symptom
|
||
|
||
Three launches of the client all **froze** (`AppHangB1`, Windows Event Log) within
|
||
seconds-to-minutes of the camera being indoors at the Holtburg cottage. Not a crash —
|
||
no access violation, no managed exception. The captured managed stack of the frozen
|
||
render thread (`hang-stack.txt`, via `dotnet-stack`) shows it **CPU-spinning**:
|
||
|
||
```
|
||
CPU_TIME
|
||
CellView.Add(ViewPolygon)
|
||
PortalVisibilityBuilder.AddRegion(CellView, List<ViewPolygon>)
|
||
PortalVisibilityBuilder.Build(...)
|
||
RetailPViewRenderer.DrawInside(...)
|
||
GameWindow.OnRender(...)
|
||
```
|
||
|
||
App.Tests 207/207 and Core 1331/4/1 are green; the bug is invisible to the suite (see §Evidence).
|
||
|
||
## Verdict
|
||
|
||
**It is NOT Task 2 (the verbatim-DrawCells / grey fix).** `Build(...)` runs at the very
|
||
top of `DrawInside` ([RetailPViewRenderer.cs:43](../../src/AcDream.App/Rendering/RetailPViewRenderer.cs)),
|
||
**before** any line Task 2 touched, and the call is byte-identical pre/post-change. Task 2's
|
||
draw logic was independently confirmed correct in the run-1 log: `[render-sig] draw=[…]`
|
||
equalled `ids=[…]` with `miss=[]`, and `[shell]` showed every visible cell drawing textured
|
||
(`zh=0`). The grey fix works.
|
||
|
||
**Root cause:** `PortalVisibilityBuilder.Build`'s portal BFS does not terminate for real
|
||
cottage geometry. It **re-enqueues a popped cell every time that cell's `CellView` grows**:
|
||
`queued.Remove(cell.CellId)` on pop ([:122](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs))
|
||
+ `if (grew && queued.Add(neighbourId))` on grow ([:289](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs)).
|
||
Termination therefore depends entirely on growth stopping. Growth is gated only by
|
||
`CellView.Add`'s **exact-match dedup** (`SamePolygon`, eps `1e-4`,
|
||
[PortalView.cs:79](../../src/AcDream.App/Rendering/PortalView.cs)). The **near-side portal clip**
|
||
(`ClipPortalAgainstView` → `PortalProjection.ProjectToClip` → `ClipToRegion`,
|
||
[:474/:485](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs)) produces a polygon
|
||
that is a hair different on each `A↔B` reciprocal round (float drift through the homogeneous
|
||
project→clip round-trip with a non-identity cell transform). The dedup never matches the
|
||
drifted near-duplicate → the region grows without bound → the cell re-enqueues forever →
|
||
`CellView.Polygons` grows to N, and `CellView.Add`'s O(N) dedup scan makes the whole thing
|
||
O(N²) → frozen.
|
||
|
||
## Evidence
|
||
|
||
1. **Captured stack** pins the spin to `CellView.Add ← AddRegion ← Build`, pure managed
|
||
`CPU_TIME` (not a GL call, not blocked, not a fault).
|
||
2. **The code already documents this exact failure** at
|
||
[:694-697](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs): the *reciprocal*
|
||
clip deliberately stays on the float-stable `ProjectToNdc` path *because*
|
||
"per-round float drift defeated the CellView SamePolygon dedup, inflating a tight A<->B
|
||
reciprocal view to ~4x its area." The **near-side** clip ([:474](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs))
|
||
did not get the same treatment — it uses `ProjectToClip`.
|
||
3. **The only bound was removed this session.** [:74](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs):
|
||
"Fixpoint termination replacing the old `MaxReprocessPerCell` hard cap." The fixpoint never
|
||
converges under drift; with the cap gone there is no other bound (no iteration cap, no
|
||
max-polygon cap, no time bound).
|
||
4. **It's the dirty-tree rewire the handoff said to KEEP.** `git diff --stat`:
|
||
`PortalVisibilityBuilder.cs +426/−45` and `PortalProjection.cs +111` are **uncommitted**.
|
||
`ProjectToClip` is part of the new `PortalProjection` lines. The handoff
|
||
(`2026-06-06-verbatim-drawcells-port-pickup-handoff.md`) lists this rewire as the faithful
|
||
foundation to preserve and says "the clip math is already faithful — do not harden the
|
||
w-clip." The clip is faithful in the *picture* it computes; it is the *non-termination*
|
||
that is broken.
|
||
5. **Why the suite is green:** `PortalVisibilityBuilderTests` build cells with
|
||
`WorldTransform = Matrix4x4.Identity` and axis-aligned quads in 2-cell **chains**
|
||
(`cam → ground → exit`). No `A↔B` cycle, no transform-induced drift → the project→clip
|
||
round-trip is exact → the dedup collapses duplicates → the BFS converges. The real cottage
|
||
is a **cyclic** cell cluster (`0x016F–0x0175`, mutual portals) with **non-identity**
|
||
transforms → drift + cycle → non-termination. The suite cannot reach the failing case.
|
||
6. **Why run 1 survived 113 frames then froze:** `Build` converges at most camera poses; only
|
||
specific poses create the non-converging drift cycle. The freeze coincided with the
|
||
metastable doorway flip (`[render-sig] stable` went 39→0, visible-cell count 5→4) one frame
|
||
before the log ended.
|
||
|
||
## Hypotheses (ranked)
|
||
|
||
1. **(confirmed)** Non-terminating BFS: re-enqueue-on-grow + `ProjectToClip` drift defeats the
|
||
`SamePolygon` dedup → unbounded `CellView` growth. Falsify: a re-process cap, a
|
||
drift-tolerant dedup, or `ProjectToNdc` on the near-side clip all make `Build` terminate.
|
||
2. *(ruled out)* GPU/driver hang from a malformed draw — the stack is pure managed `CPU_TIME`
|
||
in `CellView.Add`, never a GL call; no fault.
|
||
3. *(ruled out)* Probe-output stdout saturation — disproven: the probe-free run also hung.
|
||
4. *(ruled out)* Task 2 — `Build` is upstream of every Task 2 line and unchanged by it.
|
||
|
||
## Fix options (all additive — none reverts the dirty tree)
|
||
|
||
| | Fix | Touches | Pro | Con |
|
||
|---|---|---|---|---|
|
||
| **A** *(rec.)* | **Drift-tolerant dedup**: round clipped polygon vertices to a small grid (≈`1e-3`) before `AddRegion`, or widen/snaps `SamePolygon`'s match, so near-duplicates collapse → growth converges. | `CellView`/`AddRegion` | Fixes the actual root cause ("drift defeats dedup"); keeps the faithful `ProjectToClip`; preserves growth-propagation. ~10 lines. | Tolerance is a tuning constant (pick conservatively; over-merge = minor over-tighten). |
|
||
| **B** | **Restore a re-process bound** (`MaxReprocessPerCell`-style cap on the BFS). | `Build` loop | Smallest; guarantees termination; doesn't touch clip. | A guard, not a root fix; may under-include a late-growing view. The user's "no workarounds" rule applies — this is the band-aid. |
|
||
| **C** | **Near-side clip on `ProjectToNdc`** (what the reciprocal clip already uses). | `ClipPortalAgainstView` | Removes the drift source directly; consistent with `:694`. | Steps on this session's homogeneous near-eye clip work; the handoff's "don't harden the w-clip" is closest to here. |
|
||
|
||
**Recommended next step:** approve **A** (drift-tolerant dedup) — it closes the precise
|
||
mechanism the code half-acknowledges at `:694`, terminates structurally, and leaves the
|
||
faithful clip path intact. Implement in a follow-up (not report-only) session, then re-run the
|
||
Task 2 visual gate (probe-free) at the cottage + cellar.
|
||
|
||
## What this is NOT
|
||
|
||
- **NOT** Task 2 / the grey fix — that is verified working (`draw==ids`, `miss=[]`, textured shells).
|
||
- **NOT** a wrong-pixels / unfaithful-projection bug — it's a **termination** bug. The handoff's
|
||
"the clip math is faithful, don't harden the w-clip" is about projection *correctness*; this is
|
||
BFS *convergence*. Don't chase the w-clip.
|
||
- **NOT** a GPU/shader/driver hang and **NOT** the probe firehose (both ruled out by the stack
|
||
and the probe-free repro).
|
||
|
||
---
|
||
|
||
## Reassessment — is the dirty-tree builder rewire sound? (post Option A)
|
||
|
||
Option A (drift-tolerant `CellView.Add` dedup, `CellViewDedupTests` green) was implemented and the
|
||
client relaunched. Result: the hang **moved out of `CellView.Add`** (A worked for its target) but
|
||
**relocated to `ScreenPolygonClip.ClipByEdge`** via `ApplyReciprocalClip` (second captured stack,
|
||
`hang-stack2.txt`). `ScreenPolygonClip.Intersect`/`ClipByEdge` are **both bounded `for` loops** —
|
||
they cannot spin on one call — so the spin is the **outer `Build` BFS** still not terminating and
|
||
calling them a runaway number of times. **Option A is necessary but not sufficient.**
|
||
|
||
### Git evidence (what the dirty rewire changed re: termination)
|
||
|
||
- **HEAD (committed)** near-side portal clip = `PortalProjection.ProjectToNdc` (float-stable;
|
||
`git show HEAD:` line 146). **The dirty rewire switched it to `ProjectToClip`** (`ClipPortalAgainstView`,
|
||
dirty line 474) — the homogeneous near-eye clip, introduced to fix the near/grazing-doorway flap/void.
|
||
- The `MaxReprocessPerCell` **hard cap was removed earlier** (committed Phase U.2a `d880775`), replaced
|
||
by "fixpoint termination." **Neither HEAD nor the dirty tree has a hard iteration bound.**
|
||
- The dirty rewire's own comment (`PortalVisibilityBuilder.cs:519-522`) documents that
|
||
`ProjectToClip` "produced per-round float drift that defeated the CellView SamePolygon dedup" — and
|
||
applied that lesson **only to the reciprocal clip** (kept on `ProjectToNdc`), leaving the **near-side**
|
||
clip on the drift-prone `ProjectToClip`.
|
||
|
||
### Soundness verdict
|
||
|
||
The builder's termination model is **unsound by construction.** It relies on the clipped regions
|
||
reaching a geometric fixpoint — re-clipping a cell's view reproduces *exactly-equal* polygons that the
|
||
dedup recognises — with **no hard iteration bound.** That only holds if the clip is float-stable.
|
||
`ProjectToClip` (needed for faithful near-doorway projection) injects per-round drift, so re-clipping
|
||
never reproduces an exactly-equal polygon, the dedup never catches it, and the re-enqueue-on-grow flood
|
||
never converges → infinite loop. **You cannot have BOTH faithful near-doorway projection (`ProjectToClip`)
|
||
AND convergence-via-exact-dedup-without-a-bound.** HEAD got away with it because `ProjectToNdc` was
|
||
stable enough to converge (and it sealed — user-verified); the dirty switch tipped it into non-termination.
|
||
The rewire fixed the *projection* and, apparently never having been launched, shipped a hang.
|
||
|
||
A's drift-tolerant dedup *narrows* the gap but cannot *close* it: for some geometry the per-round drift
|
||
exceeds any fixed snap grid, so growth still produces new keys forever. Only a **hard bound** guarantees
|
||
termination.
|
||
|
||
### Paths (for the user to choose)
|
||
|
||
| | Path | Termination | Projection fidelity | Risk |
|
||
|---|---|---|---|---|
|
||
| **1** *(rec.)* | Keep `ProjectToClip` + add **enqueue-once** bound (D) — the builder's own comment already calls enqueue-once "the hard termination guarantee"; the re-enqueue-on-grow is the bug. Keep A. | Guaranteed (≤N pops) | Full (faithful doorway clip kept) | Minor under-inclusion of late growth → visual-verify; widen to a cap if needed |
|
||
| **2** | Keep `ProjectToClip` + add a **re-process cap** (B, restore `MaxReprocessPerCell`). Keep A. | Guaranteed (≤N×K) | Full | Less faithful than enqueue-once; a tuning constant |
|
||
| **3** | **Revert** the near-side `ProjectToClip → ProjectToNdc` (back to HEAD). | Restored (HEAD converged) | **Loses** the rewire's near-doorway fix → reintroduces the flap/void (separate bug) | Throws away this session's projection work; contradicts the keep-the-dirty-tree directive |
|
||
|
||
A bound (paths 1/2) is the sound fix: it makes termination independent of clip drift, so the faithful
|
||
`ProjectToClip` projection AND guaranteed termination coexist. **Recommendation: path 1** (enqueue-once +
|
||
keep A), visual-verify for under-inclusion. Reverting (path 3) only trades the hang back for the
|
||
flap/void.
|