acdream/docs/research/2026-05-20-indoor-walking-bug-a-handoff.md
Erik 35c266a800 docs(handoff): indoor walking Bug A wrong-scope handoff
Bug B (indoor BSP world-origin fix) shipped today at de8ffde.
Bug A (delete per-frame walkable-plane synthesis) attempted and
reverted at 0a7ce8f. Real bug is deeper than scoped:

Indoor cell floor polys don't cover the player's full XY range when
crossing thresholds (doorways). Step-down probes miss past the floor
edge, Mechanism C (post-OK step-down) can't catch the player,
ContactPlane invalidates, gravity pulls them through the void.

We have all three retail CP retention mechanisms (A, B, C). The
defect is geometry, not retention. Either dat-decoder missing some
floor polys, or cell-transition timing too late, or some retail
mechanism we haven't traced.

Handoff includes:
- State of every commit on this branch + KEEP/REMOVE recommendation
- Bug B evidence and recommendation to ship to main
- Bug A failure analysis with probe data
- Mechanisms A/B/C location in our code vs retail decomp anchors
- 5 prioritized investigation targets for fresh session
- Anti-patterns to avoid (don't repeat Bug A approach)
- Lessons learned (probe-first discipline, risk-as-falsification,
  3-fails-in-a-session stop signal, Matrix4x4.Decompose idiom,
  binary-timestamp paranoia)

Recommendation: merge Bug B alone, leave the rest for fresh session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:38:13 +02:00

20 KiB

Indoor walking — Bug A wrong-scope handoff (2026-05-20)

Status: Bug B shipped (de8ffde). Bug A attempted + reverted (9f874f40a7ce8f). The real bug is deeper than scoped and needs a fresh session with full context. ISSUES #83 remains OPEN.

This doc captures everything learned today so the next session picks up clean.


TL;DR

I went into today expecting to land "ContactPlane retention" as a 2-slice phase:

  • Slice 1 (Bug B): indoor BSP world-origin fix. SHIPPED at de8ffde. Closed a real corruption (320 corrupt CP writes/session with D≈0 instead of world floor Z).
  • Slice 2 (Bug A): delete the per-frame TryFindIndoorWalkablePlane synthesis on the indoor OK path. REVERTED. Caused worse regression (player fell through ground when crossing thresholds).

The probe + decomp study revealed Slice 2's premise was wrong:

  • Retail's BSPTREE::find_collisions Path 5B (grounded mover) does NOT call find_walkable either. It only checks for walls. So Bug A's "delete the synthesis and trust the BSP" had nothing to fall back on for the no-step-down case.
  • Retail keeps grounded movement coherent via THREE interacting mechanisms — A (Path 6 land), B (LKCP proximity restore), C (post-OK step-down probe). We have all three in our code already.
  • The actual failure mode is when the player crosses a threshold (doorway) and the step-down probe finds no floor poly at the new XY. Step-down returns OK without writing CP, Mechanism B's proximity check fails because the player moved laterally past the cached plane, oi.Contact clears, player goes airborne, gravity wins.

This is a cell geometry / cell-transition problem, not a CP retention problem. Outside Bug A's scope.


What's on main / what's on this branch

Branch: claude/sad-aryabhata-2d2479 (worktree, not merged).

Commits ahead of main (in order):

SHA Subject Status
66de00d feat(physics): [cp-write] probe for ContactPlane retention spike KEEP — invaluable for next session
865634f docs(spec): indoor BSP world-origin / world-rotation fix (Bug B) KEEP — describes the shipped fix
56816fc docs(plan): indoor BSP world-origin fix implementation plan KEEP
39d4e65 test(physics): BSPQuery.FindCollisions writes world-space plane... KEEP — regression test for Bug B
de8ffde fix(physics): pass cell world-transform to indoor BSP collision KEEP — the Bug B fix
3bec18f docs(spec): remove per-frame indoor walkable-plane synthesis (Bug A) KEEP but mark wrong-approach
686f27f docs(plan): remove per-frame indoor walkable-plane synthesis (Bug A) KEEP but mark wrong-approach
9f874f4 fix(physics): remove per-frame indoor walkable-plane synthesis REVERTED by next commit
0a7ce8f Revert "fix(physics): remove per-frame indoor walkable-plane synthesis" The revert. Brings back pre-Bug-A behavior.

The branch is in a self-consistent post-Bug-B state: world-origin fix shipped, synthesis re-instated as it was before the session.

Decision for next session: merging Bug B to main is safe (closes a real corruption with strong probe evidence). The Bug A spec/plan + revert can stay on this branch as a tried-and-reverted record, or get cleaned up before merging.


What Bug B actually fixed (slice 1, shipped)

The defect

Indoor cell BSP queries at TransitionTypes.cs:1442 invoked BSPQuery.FindCollisions with Quaternion.Identity + defaulted Vector3.Zero for worldOrigin. Inside the BSP, Path 3 (step_sphere_down) and Path 4 (land-on-surface) use those args via TransformVertices + BuildWorldPlane to produce a world-space ContactPlane. With both args defaulted, the produced plane was in cell-LOCAL space — D ≈ 0 instead of D = -world_floor_Z (e.g., -94.02 for Holtburg cottages).

The fix (de8ffde)

Quaternion cellRotation;
Vector3    cellOrigin;
if (!Matrix4x4.Decompose(cellPhysics.WorldTransform, out _, out cellRotation, out cellOrigin))
{
    Console.WriteLine($"[indoor-bsp] WARN cellPhysics.WorldTransform did not decompose ...");
    cellRotation = Quaternion.Identity;
    cellOrigin   = cellPhysics.WorldTransform.Translation;
}

var cellState = BSPQuery.FindCollisions(
    cellPhysics.BSP.Root, cellPhysics.Resolved, this,
    localSphere, localSphere1, localCurrCenter,
    Vector3.UnitZ, 1.0f,
    cellRotation,
    engine,
    worldOrigin: cellOrigin);

Mirrors the existing correct pattern at TransitionTypes.cs:1808 (object BSP via FindObjCollisions).

Evidence (probe-driven)

Pre-fix (launch-cp-probe.log): 320 [cp-write] caller=BSPQuery.StepSphereDown:1123 writes producing D=-0.000 instead of D=-94.020.

Post-fix (launch-cp-probe-postfix-v2.log): step-down writes show D=-94.020, D=-66.020, D=-158.994, D=-159.129 — all matching the cell's actual world floor Z. The 2 remaining D=0.000 outliers are either polygons legitimately at world Z=0 or marginal edge cases.

Tests

  • Unit test added: BSPQueryTests.FindCollisions_StepDown_TranslatedWorldOrigin_WritesWorldSpacePlane — verifies BSPQuery writes world-space CP when called with a translated worldOrigin.
  • 8-failure physics baseline holds (no new regressions).

Recommendation

Ship Bug B alone. The Bug A spec/plan + revert can stay or get cleaned. The probe (66de00d) is worth keeping in tree until the deeper investigation is complete.


What Bug A tried and why it failed (slice 2, reverted)

The hypothesis

Per the previous handoff (docs/research/2026-05-19-indoor-walkable-plane-bsp-port-shipped-handoff.md) and the subagent's first decomp study, retail's BSPTREE::find_collisions does NOT call find_walkable on the OK path. ContactPlane is retained across OK frames from the prior tick's seed (our equivalent: PhysicsEngine.ResolveWithTransition:583, the init_contact_plane analogue). The synthesis we added in Phase 2 (eb0f772 2026-05-19) was an unfaithful stop-gap that runs every frame, 99.87% MISSES due to tangent epsilon rejection in walkable_hits_sphere, and falls through to outdoor terrain → wrong CP plane.

Proposed fix: delete the synthesis call + outdoor fallthrough from the indoor OK path. Just return TransitionState.OK; after the indoor BSP returns OK. Let CP retain via the seed and let BSP Path 3/4 refresh it during step-down or landing.

The fix (9f874f4)

Replaced TryFindIndoorWalkablePlane(...) → ValidateWalkable(...) → fallthrough to outdoor terrain with return TransitionState.OK;. Deleted the method + constant + 9 tests. -491 lines.

The regression

User report: "I could not get out of the building, I had to jump out of the door, then I fell through the ground."

Probe data (launch-buga-v2.log):

[indoor-bsp] cell=0xA9B40125 wpos=(96.880,159.403,61.536) result=OK
[indoor-bsp] cell=0xA9B40125 wpos=(96.800,159.603,61.336) result=OK
[indoor-bsp] cell=0xA9B40125 wpos=(96.720,159.803,61.130) result=OK
...continues until wpos=(67,233,-262) ~350m below cell floor

The player's Z decreased ~0.2m per tick (gravity step), inside an indoor cell, with the BSP returning OK every frame (no walls below them). No step-down probe lines firing during the fall — oi.Contact had cleared.

Why Mechanisms A/B/C didn't catch this

The full decomp study (in this session's subagent transcript, file C:\Users\erikn\AppData\Local\Temp\claude\C--Users-erikn-source-repos-acdream--claude-worktrees-sad-aryabhata-2d2479\cd9bbcf4-a861-4797-99e3-8c1c623ff66e\tasks\a88c5ab14446853ea.output) mapped retail's three CP retention mechanisms:

Retail location acdream location Status
A — Path 6 collide-path land (set_contact_plane) acclient_2013_pseudo_c.txt:323924 BSPQuery.cs:1615 (Path 4) Present, works
Bvalidate_transition LKCP proximity restore :272565-272578 TransitionTypes.cs:2618-2662 Present, has proximity-guard
Ctransitional_insert post-OK step-down probe :273242-273307 TransitionTypes.cs:896-933 Present, gated on oi.Contact && !ci.ContactPlaneValid && oi.StepDown

All three exist in our code. The failure was that they're all gated on conditions that fail in the doorway-crossing case:

  1. Step-down probe (Mech C) fires correctly: log shows ~209 successful Adjusted results from BSPQuery.StepSphereDown.
  2. Player walks toward the cottage doorway. Sub-step moves lpos.Y from -5.994 to -6.398 (past the cottage floor edge).
  3. At new position, step-down probe BSP returns OK + poly=n/a (no floor poly at this XY) — same for Z probes at -0.75, -1.5, -2.25. The cottage's indoor cell has no floor poly extending past the doorway threshold.
  4. Step-down returns OK without writing CP. ci.ContactPlaneValid stays false.
  5. Mechanism B (LKCP proximity) checks distance from sphere to cached plane: sphere moved ~0.4m laterally, the prior plane is at the prior XY, but the proximity check is radius + EPSILON > |angle| where angle = N·sphere + D. For a horizontal floor, angle = sphere.Z - cached_floor_Z. If sphere.Z hasn't moved much vertically, this should pass...
    • Actually: I didn't fully trace this. Mech B might fire correctly. Need next-session probe to confirm.
  6. Either way: by the time the player has traveled a few sub-steps with no floor underneath, oi.Contact clears via the ValidateTransition else-branch (line 2664-2666). Mech C stops firing (it requires oi.Contact).
  7. Player free-falls. Path 5 stops firing (no Contact). Path 6 fires for airborne movement. No CP gets re-established.

Why the previous "stuck-falling on 2nd-floor edge" symptom is the same bug

The user's PRIOR symptom (pre-Bug-A): "Walking up the stairs, if I sort of just touch the floor on top of me I get stuck in falling animation."

That's the same root cause manifesting in a different geometry: step-down probe doesn't find a floor poly at the 2nd-floor edge → Mechanism C can't catch → some path along the synthesis → wrong CP → ValidateWalkable marks airborne → falling animation never recovers.

The Phase 2 synthesis (TryFindIndoorWalkablePlane) was a duct-tape over this — it tried to find a "best-guess" floor poly via XY scan. When the scan returned the wrong poly (rare HIT case) or missed (99% case) the player got stuck. But it didn't make them free-fall through the void because the fallthrough to outdoor terrain at least gave them SOMETHING (just slightly below the cottage floor).

Bug A removed the duct-tape. With nothing replacing it, the player free-falls.

Key insight: the duct-tape was hiding a deeper bug

The Phase 2 synthesis (eb0f772) was patching over a real defect: indoor cell floor polygons don't extend to cover the player's full possible XY range when crossing thresholds. Either:

  • (a) Retail's indoor cells have floor polys that extend further than ours do (dat-decoder bug?).
  • (b) Retail's cell-transition timing moves the player into the outdoor cell BEFORE they step past the indoor floor poly edge, so the indoor BSP query at the threshold always has a floor under the sphere.
  • (c) Retail has a mechanism we haven't found yet that handles "no floor poly at this XY" gracefully (e.g., extending the search to neighbor cells via portals).
  • (d) Retail's player-collision-sphere is sized differently so the player physically can't reach the edge of the cottage floor.

Without further investigation, I can't say which. The next session needs to figure this out.


State of the [cp-write] probe

Committed at 66de00d. Converts 8 CollisionInfo fields (CP + LKCP groups, 4 sub-fields each) from public fields to public properties with logging setters. Logging is gated on PhysicsDiagnostics.ProbeContactPlaneEnabled (env var ACDREAM_PROBE_CONTACT_PLANE=1, also runtime-toggleable). When the flag is off, the property accessors are inlined to direct field access by the JIT — zero cost.

Keep this in tree — the next session will need it to validate any new hypothesis before designing the fix. The Bug B + Bug A specs both say "remove the probe when the retention fix lands"; that's not yet, defer the removal.

The probe surfaces:

  • Each write site's source line (PhysicsEngine.cs:583, BSPQuery.cs:1123, BSPQuery.cs:1615, TransitionTypes.cs:663 etc).
  • Old value → new value, only logged when actually changed (value-equality suppression in the setter).
  • Plane Normal + D, CellId, IsWater, Valid flags.

Caller distribution from the failed Bug A run is in launch-buga-v2.utf8.log:

Count Caller Role
57,144 PhysicsEngine.ResolveWithTransition:583 Per-tick seed (init_contact_plane)
607 Transition.FindTransitionalPosition:663 Sub-step CPV=0 reset
341 Transition.ValidateWalkable:1488 Outdoor terrain on-surface
217 BSPQuery.StepSphereDown:1123 Path 3 step-down (Mechanism C fires)
19 Transition.ValidateWalkable:1511 Outdoor terrain below-surface
0 Transition.ValidateWalkable (indoor) Bug A removed the indoor path
0 [indoor-walkable] lines Bug A removed the probe

Investigation targets for next session

If picking this up, the priority order:

  1. Confirm the doorway-edge hypothesis with cdb on retail. The retail debugger toolchain (CLAUDE.md "Retail debugger toolchain" section) lets us attach to a live retail client. Set a breakpoint at BSPLEAF::find_walkable and walk the same cottage threshold. Capture the polygons the floor BSP iterates over. Either:

    • Retail's cell has more floor polys covering the threshold → our dat-decoder is missing some polys.
    • Retail's cell-id changes BEFORE the sphere reaches the edge → our cell-transition timing lags.
    • Retail does something we haven't seen yet.
  2. Cross-reference with WorldBuilder. The CLAUDE.md "Reference hierarchy by domain" table says WB is the production base for EnvCell geometry. Look at WorldBuilder/EnvCellRenderManager.cs and WorldBuilder/PortalRenderManager.cs for how WB handles cell boundaries.

  3. Add a probe that logs each indoor cell's floor poly count + extent. Diagnostic-only. When the player enters an indoor cell, dump the cell's floor polys + their XY bounding boxes. Compare to the player's eventual XY position when step-down misses. Tells us whether the floor poly genuinely doesn't extend that far OR whether something else is wrong.

  4. Look at Phase 2 cell-transition work. The [cell-transit] probe + the portal-graph traversal in CellTransit.FindCellList were shipped 2026-05-19 (commits 1969c55 through eb0f772). Whether they fire in time at the cottage doorway is unclear.

  5. Don't repeat the Bug A approach. "Just delete the synthesis and trust BSP" doesn't work because the BSP genuinely has no floor poly at the threshold. Some replacement is needed — the question is what.

Anti-patterns the next session should avoid

  • Don't trust the previous handoff's recommendation blindly. The 2026-05-19 handoff said "remove TryFindIndoorWalkablePlane" — that recommendation was based on incomplete decomp analysis. The proper fix requires understanding cell geometry, not just CP retention.
  • Don't design a fix before the probe data points at the right code path. I designed Bug A's spec on a "Mechanism C will catch us" assumption that the data didn't validate.
  • Don't fix two related bugs in one session. Bug B + Bug A were both indoor-CP issues but they had different root causes. Slicing them was the right call; what went wrong was Bug A's design.

Things to definitely KEEP from today's work

  • Bug B fix (de8ffde) — closes a real corruption.
  • The [cp-write] probe (66de00d).
  • The [indoor-bsp] probe (pre-existing, from Phase 1).
  • BSPQuery regression test (39d4e65).
  • Spec + plan docs for Bug B (good engineering artifacts).
  • This handoff doc.

Things to consider removing on next session

  • Bug A spec/plan docs (3bec18f / 686f27f) — they document a wrong approach. Optional to delete; they're useful as a "tried this, didn't work, here's why" record.

How to start a fresh session

Copy this into a new Claude Code session in the acdream worktree:

Pick up the acdream indoor walking issue (ISSUES #83). Read
docs/research/2026-05-20-indoor-walking-bug-a-handoff.md FIRST. The
prior session today shipped Bug B (BSP world-origin fix, de8ffde) but
attempted-and-reverted Bug A. The real bug is deeper than scoped — see
the handoff for the full diagnosis and investigation targets.

1. Don't try Bug A again ("just delete TryFindIndoorWalkablePlane and
   trust retention"). That was today's wrong approach; data showed
   Mechanism C can't catch when there's no floor poly past the
   threshold.
2. The probe (66de00d) + [indoor-bsp] probe should stay in tree until
   the proper fix lands.
3. Investigation targets are in the handoff's "Investigation targets
   for next session" section. The most useful first move is probably
   attaching cdb to retail at the same cottage threshold and watching
   what BSPLEAF::find_walkable iterates over.
4. CLAUDE.md rules apply. No workarounds, no band-aids. Visual
   verification is the acceptance test.
5. M2 critical path candidates remain (F.2 / F.3 / F.5a / L.1c /
   L.1b). If this investigation looks like it'll burn a phase or two
   to nail down, consider whether the user wants you to pivot to M2
   work and address indoor walking in M7 polish.

State the milestone + chosen phase in the first action you take.

Or just say "Read docs/research/2026-05-20-indoor-walking-bug-a-handoff.md and start a fresh session."


Lessons from today (for future Claude)

  1. The user's pickup-prompt language was right: probe-first, design-second. I did the probe spike for Bug B — that worked great. For Bug A I didn't do an equivalent spike for "will Mechanism C catch the no-floor case?" before deleting the synthesis. The R1 risk I called out in the spec was the actual failure mode.

  2. A spec's "Out of scope" + "Risks" sections can lie. I wrote them after the design was decided, and they reflected the design's blind spots, not actual blind spots. Next time: list risks BEFORE writing the design, treat them as falsification tests, validate them with the probe before shipping.

  3. "Three failed visual verifications in a session" is the stop signal. I got to two and pushed for a third (which triggered the revert decision via the user's "Got stuck falling in the staircase" report + "I had to jump out of the door, then I fell through the ground" report). The third should have been the trigger to stop and write the handoff — instead I dispatched another subagent and dug deeper. That additional dig was useful (it surfaced the doorway-edge insight) but it would have happened in the fresh session too with a fresher context budget.

  4. Matrix4x4.Decompose works fine for cell transforms. Bug B's mechanical fix landed cleanly. The pattern (decompose once at the call site, pass rotation + origin to a function that previously took defaults) is a clean idiom for places where we have a Matrix and the API wants a Quaternion + Vector3.

  5. Test build + binary timestamp paranoia is real. During Bug B's first visual verification, my test passed but I'd accidentally rebuilt the AcDream.Core DLL from un-stashed code, so the launched client didn't have the fix. The mismatch was only caught by checking the binary mtime against the source mtime. After every code change to be tested in the client, verify the build is fresh.


Recommendation: merge Bug B to main. Keep the rest of this branch around as a learning artifact. Start fresh on the deeper investigation in a new session with this handoff as the starting brief.