acdream/docs/research/2026-06-06-indoor-render-hang-rootcause.md
Erik 1405dd8e90 feat(render): indoor render WORKS — terminating portal flood + every-cell seal + look-in FPS
Checkpoint of the unified retail-faithful indoor render. The two-week HANG/grey is fixed and the
interior seals (live-verified by the user). Commits the session render-rewrite foundation together
with the fixes that made it functional.

- HANG fix: PortalVisibilityBuilder.Build portal flood did not terminate (the faithful ProjectToClip
  near-side clip drifts per round, defeating the CellView dedup; the BFS had no bound after U.2a removed
  MaxReprocessPerCell). Fix = drift-tolerant snapped/canonical CellView.Add dedup (PortalView.cs) plus
  restored MaxReprocessPerCell=16 bounded re-enqueue (PortalVisibilityBuilder.cs). Re-enqueue is kept
  (load-bearing for late-slice propagation, Build_ViewGrowthAfterDoneCell_PropagatesNewSlicesToExit);
  only its count is capped. CellViewDedupTests added.
- Seal (DrawCells Task 2): RetailPViewRenderer.DrawEnvCellShells draws EVERY visible cell via
  IndoorDrawPlan.ShellPass (was gated on the ClipFrameAssembler slot filter, leaving slot-less cells grey).
- Look-in FPS: GameWindow exterior look-in candidates limited to the player landblock +-1 (was all ~81
  loaded LBs iterated every outdoor frame). No behaviour change (far cells were >48m, already culled).

Remaining dominant issue = the FLAP at transitions: viewer-cell metastability (render roots at the
camera-eye cell, which oscillates outdoor-indoor as the 3rd-person boom drifts across the doorway,
confirmed in render-sig). SEPARATE fix, NOT the DrawCells port. Full handoff + flap fix plan + tracked
follow-ups (#78 terrain, look-in-from-inside, look-in FPS, L-spotlight):
docs/research/2026-06-07-indoor-render-session-handoff.md.

Baselines: build 0 err; App.Tests 210/210; Core.Tests 1331 pass / 4 fail (pre-existing) / 1 skip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 10:14:43 +02:00

163 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Indoor render HANG — root cause: `PortalVisibilityBuilder.Build` non-termination — 2026-06-06
> Report-only investigation (user chose "investigate more first"). **No code changed.**
> Worktree `thirsty-goldberg-51bb9b`. This blocks the verbatim-DrawCells port's Task 2
> visual gate: every indoor frame can freeze here.
## Symptom
Three launches of the client all **froze** (`AppHangB1`, Windows Event Log) within
seconds-to-minutes of the camera being indoors at the Holtburg cottage. Not a crash —
no access violation, no managed exception. The captured managed stack of the frozen
render thread (`hang-stack.txt`, via `dotnet-stack`) shows it **CPU-spinning**:
```
CPU_TIME
CellView.Add(ViewPolygon)
PortalVisibilityBuilder.AddRegion(CellView, List<ViewPolygon>)
PortalVisibilityBuilder.Build(...)
RetailPViewRenderer.DrawInside(...)
GameWindow.OnRender(...)
```
App.Tests 207/207 and Core 1331/4/1 are green; the bug is invisible to the suite (see §Evidence).
## Verdict
**It is NOT Task 2 (the verbatim-DrawCells / grey fix).** `Build(...)` runs at the very
top of `DrawInside` ([RetailPViewRenderer.cs:43](../../src/AcDream.App/Rendering/RetailPViewRenderer.cs)),
**before** any line Task 2 touched, and the call is byte-identical pre/post-change. Task 2's
draw logic was independently confirmed correct in the run-1 log: `[render-sig] draw=[…]`
equalled `ids=[…]` with `miss=[]`, and `[shell]` showed every visible cell drawing textured
(`zh=0`). The grey fix works.
**Root cause:** `PortalVisibilityBuilder.Build`'s portal BFS does not terminate for real
cottage geometry. It **re-enqueues a popped cell every time that cell's `CellView` grows**:
`queued.Remove(cell.CellId)` on pop ([:122](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs))
+ `if (grew && queued.Add(neighbourId))` on grow ([:289](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs)).
Termination therefore depends entirely on growth stopping. Growth is gated only by
`CellView.Add`'s **exact-match dedup** (`SamePolygon`, eps `1e-4`,
[PortalView.cs:79](../../src/AcDream.App/Rendering/PortalView.cs)). The **near-side portal clip**
(`ClipPortalAgainstView``PortalProjection.ProjectToClip``ClipToRegion`,
[:474/:485](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs)) produces a polygon
that is a hair different on each `A↔B` reciprocal round (float drift through the homogeneous
project→clip round-trip with a non-identity cell transform). The dedup never matches the
drifted near-duplicate → the region grows without bound → the cell re-enqueues forever →
`CellView.Polygons` grows to N, and `CellView.Add`'s O(N) dedup scan makes the whole thing
O(N²) → frozen.
## Evidence
1. **Captured stack** pins the spin to `CellView.Add ← AddRegion ← Build`, pure managed
`CPU_TIME` (not a GL call, not blocked, not a fault).
2. **The code already documents this exact failure** at
[:694-697](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs): the *reciprocal*
clip deliberately stays on the float-stable `ProjectToNdc` path *because*
"per-round float drift defeated the CellView SamePolygon dedup, inflating a tight A<->B
reciprocal view to ~4x its area." The **near-side** clip ([:474](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs))
did not get the same treatment — it uses `ProjectToClip`.
3. **The only bound was removed this session.** [:74](../../src/AcDream.App/Rendering/PortalVisibilityBuilder.cs):
"Fixpoint termination replacing the old `MaxReprocessPerCell` hard cap." The fixpoint never
converges under drift; with the cap gone there is no other bound (no iteration cap, no
max-polygon cap, no time bound).
4. **It's the dirty-tree rewire the handoff said to KEEP.** `git diff --stat`:
`PortalVisibilityBuilder.cs +426/45` and `PortalProjection.cs +111` are **uncommitted**.
`ProjectToClip` is part of the new `PortalProjection` lines. The handoff
(`2026-06-06-verbatim-drawcells-port-pickup-handoff.md`) lists this rewire as the faithful
foundation to preserve and says "the clip math is already faithful — do not harden the
w-clip." The clip is faithful in the *picture* it computes; it is the *non-termination*
that is broken.
5. **Why the suite is green:** `PortalVisibilityBuilderTests` build cells with
`WorldTransform = Matrix4x4.Identity` and axis-aligned quads in 2-cell **chains**
(`cam → ground → exit`). No `A↔B` cycle, no transform-induced drift → the project→clip
round-trip is exact → the dedup collapses duplicates → the BFS converges. The real cottage
is a **cyclic** cell cluster (`0x016F0x0175`, mutual portals) with **non-identity**
transforms → drift + cycle → non-termination. The suite cannot reach the failing case.
6. **Why run 1 survived 113 frames then froze:** `Build` converges at most camera poses; only
specific poses create the non-converging drift cycle. The freeze coincided with the
metastable doorway flip (`[render-sig] stable` went 39→0, visible-cell count 5→4) one frame
before the log ended.
## Hypotheses (ranked)
1. **(confirmed)** Non-terminating BFS: re-enqueue-on-grow + `ProjectToClip` drift defeats the
`SamePolygon` dedup → unbounded `CellView` growth. Falsify: a re-process cap, a
drift-tolerant dedup, or `ProjectToNdc` on the near-side clip all make `Build` terminate.
2. *(ruled out)* GPU/driver hang from a malformed draw — the stack is pure managed `CPU_TIME`
in `CellView.Add`, never a GL call; no fault.
3. *(ruled out)* Probe-output stdout saturation — disproven: the probe-free run also hung.
4. *(ruled out)* Task 2 — `Build` is upstream of every Task 2 line and unchanged by it.
## Fix options (all additive — none reverts the dirty tree)
| | Fix | Touches | Pro | Con |
|---|---|---|---|---|
| **A** *(rec.)* | **Drift-tolerant dedup**: round clipped polygon vertices to a small grid (≈`1e-3`) before `AddRegion`, or widen/snaps `SamePolygon`'s match, so near-duplicates collapse → growth converges. | `CellView`/`AddRegion` | Fixes the actual root cause ("drift defeats dedup"); keeps the faithful `ProjectToClip`; preserves growth-propagation. ~10 lines. | Tolerance is a tuning constant (pick conservatively; over-merge = minor over-tighten). |
| **B** | **Restore a re-process bound** (`MaxReprocessPerCell`-style cap on the BFS). | `Build` loop | Smallest; guarantees termination; doesn't touch clip. | A guard, not a root fix; may under-include a late-growing view. The user's "no workarounds" rule applies — this is the band-aid. |
| **C** | **Near-side clip on `ProjectToNdc`** (what the reciprocal clip already uses). | `ClipPortalAgainstView` | Removes the drift source directly; consistent with `:694`. | Steps on this session's homogeneous near-eye clip work; the handoff's "don't harden the w-clip" is closest to here. |
**Recommended next step:** approve **A** (drift-tolerant dedup) — it closes the precise
mechanism the code half-acknowledges at `:694`, terminates structurally, and leaves the
faithful clip path intact. Implement in a follow-up (not report-only) session, then re-run the
Task 2 visual gate (probe-free) at the cottage + cellar.
## What this is NOT
- **NOT** Task 2 / the grey fix — that is verified working (`draw==ids`, `miss=[]`, textured shells).
- **NOT** a wrong-pixels / unfaithful-projection bug — it's a **termination** bug. The handoff's
"the clip math is faithful, don't harden the w-clip" is about projection *correctness*; this is
BFS *convergence*. Don't chase the w-clip.
- **NOT** a GPU/shader/driver hang and **NOT** the probe firehose (both ruled out by the stack
and the probe-free repro).
---
## Reassessment — is the dirty-tree builder rewire sound? (post Option A)
Option A (drift-tolerant `CellView.Add` dedup, `CellViewDedupTests` green) was implemented and the
client relaunched. Result: the hang **moved out of `CellView.Add`** (A worked for its target) but
**relocated to `ScreenPolygonClip.ClipByEdge`** via `ApplyReciprocalClip` (second captured stack,
`hang-stack2.txt`). `ScreenPolygonClip.Intersect`/`ClipByEdge` are **both bounded `for` loops**
they cannot spin on one call — so the spin is the **outer `Build` BFS** still not terminating and
calling them a runaway number of times. **Option A is necessary but not sufficient.**
### Git evidence (what the dirty rewire changed re: termination)
- **HEAD (committed)** near-side portal clip = `PortalProjection.ProjectToNdc` (float-stable;
`git show HEAD:` line 146). **The dirty rewire switched it to `ProjectToClip`** (`ClipPortalAgainstView`,
dirty line 474) — the homogeneous near-eye clip, introduced to fix the near/grazing-doorway flap/void.
- The `MaxReprocessPerCell` **hard cap was removed earlier** (committed Phase U.2a `d880775`), replaced
by "fixpoint termination." **Neither HEAD nor the dirty tree has a hard iteration bound.**
- The dirty rewire's own comment (`PortalVisibilityBuilder.cs:519-522`) documents that
`ProjectToClip` "produced per-round float drift that defeated the CellView SamePolygon dedup" — and
applied that lesson **only to the reciprocal clip** (kept on `ProjectToNdc`), leaving the **near-side**
clip on the drift-prone `ProjectToClip`.
### Soundness verdict
The builder's termination model is **unsound by construction.** It relies on the clipped regions
reaching a geometric fixpoint — re-clipping a cell's view reproduces *exactly-equal* polygons that the
dedup recognises — with **no hard iteration bound.** That only holds if the clip is float-stable.
`ProjectToClip` (needed for faithful near-doorway projection) injects per-round drift, so re-clipping
never reproduces an exactly-equal polygon, the dedup never catches it, and the re-enqueue-on-grow flood
never converges → infinite loop. **You cannot have BOTH faithful near-doorway projection (`ProjectToClip`)
AND convergence-via-exact-dedup-without-a-bound.** HEAD got away with it because `ProjectToNdc` was
stable enough to converge (and it sealed — user-verified); the dirty switch tipped it into non-termination.
The rewire fixed the *projection* and, apparently never having been launched, shipped a hang.
A's drift-tolerant dedup *narrows* the gap but cannot *close* it: for some geometry the per-round drift
exceeds any fixed snap grid, so growth still produces new keys forever. Only a **hard bound** guarantees
termination.
### Paths (for the user to choose)
| | Path | Termination | Projection fidelity | Risk |
|---|---|---|---|---|
| **1** *(rec.)* | Keep `ProjectToClip` + add **enqueue-once** bound (D) — the builder's own comment already calls enqueue-once "the hard termination guarantee"; the re-enqueue-on-grow is the bug. Keep A. | Guaranteed (≤N pops) | Full (faithful doorway clip kept) | Minor under-inclusion of late growth → visual-verify; widen to a cap if needed |
| **2** | Keep `ProjectToClip` + add a **re-process cap** (B, restore `MaxReprocessPerCell`). Keep A. | Guaranteed (≤N×K) | Full | Less faithful than enqueue-once; a tuning constant |
| **3** | **Revert** the near-side `ProjectToClip → ProjectToNdc` (back to HEAD). | Restored (HEAD converged) | **Loses** the rewire's near-doorway fix → reintroduces the flap/void (separate bug) | Throws away this session's projection work; contradicts the keep-the-dirty-tree directive |
A bound (paths 1/2) is the sound fix: it makes termination independent of clip drift, so the faithful
`ProjectToClip` projection AND guaranteed termination coexist. **Recommendation: path 1** (enqueue-once +
keep A), visual-verify for under-inclusion. Reverting (path 3) only trades the hang back for the
flap/void.