diff --git a/docs/superpowers/specs/2026-05-19-phase2-indoor-cell-rendering-fix-design.md b/docs/superpowers/specs/2026-05-19-phase2-indoor-cell-rendering-fix-design.md new file mode 100644 index 0000000..bbc7c57 --- /dev/null +++ b/docs/superpowers/specs/2026-05-19-phase2-indoor-cell-rendering-fix-design.md @@ -0,0 +1,189 @@ +# Indoor Cell Rendering Fix — Phase 2 Design + +**Status:** Brainstormed 2026-05-19. Awaiting user review. +**Scope:** Surface the silent failure in WB's `PrepareEnvCellMeshData` for 26/123 Holtburg cells, then implement the targeted fix. +**Predecessor:** Phase 1 (`docs/superpowers/specs/2026-05-19-indoor-cell-rendering-fix-design.md`) shipped the five `[indoor-*]` probes that confirmed hypothesis H1. +**Capture evidence:** `docs/research/2026-05-19-indoor-cell-rendering-probe-capture.md`. + +--- + +## 1. What we know + +Phase 1's `ACDREAM_PROBE_INDOOR_ALL=1` capture at Holtburg `0xA9B4` proved: + +- 123 EnvCells requested via `WbMeshAdapter.IncrementRefCount` → only **97 complete**. +- **26 cells** silently fail. They get `[indoor-upload] requested` but never `[indoor-upload] completed`. +- The dispatcher then tries to draw them, `TryGetRenderData` returns null, draw is silently skipped → user sees **missing floor**. +- The first interior cell `0xA9B40100` (likely the inn entry or another major building anchor) is among the 26. + +The smoking gun is in WB's [`ObjectMeshManager.PrepareMeshData`](../../../references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs): + +```csharp +catch (Exception ex) { + _logger.LogError(ex, "Error preparing mesh data for 0x{Id:X16}", id); + return null; +} +``` + +WB logs the exception via its injected `_logger`. But [`WbMeshAdapter.cs:71`](../../../src/AcDream.App/Rendering/Wb/WbMeshAdapter.cs:71) constructs `ObjectMeshManager` with `NullLogger.Instance` — so the log goes to `/dev/null`. The exception type and message are lost. + +## 2. Solution — three components + +### Component 1 — Exception-surfacing wrap + +Capture the `Task` returned by `_meshManager.PrepareMeshDataAsync(id, isSetup: false)` and attach a continuation that, for EnvCell IDs only, logs the failure cause. + +Three logged outcomes: + +- **Task faulted** → `[indoor-upload] FAILED cellId=0x... exception=: stack=[]`. Unwrap `AggregateException.InnerException` for cleaner output. +- **Task succeeded with null result** → `[indoor-upload] NULL_RESULT cellId=0x...`. WB's deliberate null-return path (e.g., `ResolveId` returned empty, type was `Unknown`). +- **Task succeeded with non-null result** → no extra log. The existing `Tick()` drain already emits `[indoor-upload] completed`. + +The continuation: +- Runs on `TaskScheduler.Default` (`ThreadPool`) so it doesn't block the render thread. +- Only attached for EnvCell IDs (gated by `RenderingDiagnostics.IsEnvCellId(id)`) when `ProbeIndoorUploadEnabled` is true — zero cost when off. +- Captures `cellId` (a `ulong` value) only; no instance closure leakage. +- Truncates stack trace to top 3 frames. + +Concrete code shape: + +```csharp +if (_metadataPopulated.Add(id)) +{ + PopulateMetadata(id); + var prepTask = _meshManager.PrepareMeshDataAsync(id, isSetup: false); + + if (RenderingDiagnostics.IsEnvCellId(id) && RenderingDiagnostics.ProbeIndoorUploadEnabled) + { + _pendingEnvCellRequests.Add(id); + Console.WriteLine($"[indoor-upload] requested cellId=0x{id:X8}"); + + ulong cellId = id; + _ = prepTask.ContinueWith(t => + { + if (t.IsFaulted && t.Exception is not null) + { + var ex = t.Exception.InnerException ?? t.Exception; + var stack = (ex.StackTrace ?? "").Split('\n') + .Take(3).Select(s => s.Trim()).Where(s => s.Length > 0); + Console.WriteLine( + $"[indoor-upload] FAILED cellId=0x{cellId:X8} " + + $"exception={ex.GetType().Name}: {ex.Message} " + + $"stack=[{string.Join(" | ", stack)}]"); + } + else if (t.IsCompletedSuccessfully && t.Result is null) + { + Console.WriteLine($"[indoor-upload] NULL_RESULT cellId=0x{cellId:X8}"); + } + }, TaskScheduler.Default); + } +} +``` + +`using System.Linq;` and `using System.Threading.Tasks;` may need adding (likely already present). + +### Component 2 — Capture procedure + +Standard launch: + +```powershell +$env:ACDREAM_PROBE_INDOOR_UPLOAD = "1" +dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath launch.log +``` + +Walk into Holtburg Inn, walk into nearby buildings whose cells were on the missing-26 list (`0xA9B40100`, `0xA9B40111`, etc.). Close gracefully. + +Analyze: + +```powershell +Get-Content launch.log | + Where-Object { $_ -match '\[indoor-upload\] (FAILED|NULL_RESULT)' } | + Select-Object -Unique +``` + +Expected output: a per-cell list of distinct exception types or null-return signals. Most cells likely share 1–3 root causes. + +### Component 3 — Targeted fix (shape unknown until Component 2 captures) + +Once Component 2 reveals the exception type + message, the fix is one localized code change. Likely shapes: + +| Captured cause | Fix shape | +|---|---| +| Texture decode `Exception` (e.g. `KeyNotFoundException` on surface ID) | Guard at `WbMeshAdapter.PopulateMetadata` or pre-validate surfaces; possibly patch WB fork. | +| `KeyNotFoundException` for missing `Environment` / `CellStruct` | Log + skip cell with a sentinel render-data; report which dat is stale. | +| `NullReferenceException` in `PrepareCellStructMeshData` | Add null guard at the specific call site. | +| WB internal logic bug | Fork patch to WB. | +| `NULL_RESULT` (ResolveId returned empty / type was Unknown) | Investigate dat file integrity; possibly user needs a dat update. | + +The fix is one or two code edits, lands as a single commit, and is followed by a re-launch verifying: +- `[indoor-upload] FAILED` / `NULL_RESULT` lines disappear for the previously-failing cells. +- `[indoor-upload] completed` appears for those cells. +- Visual verification: floor renders in Holtburg Inn. + +--- + +## 3. Edge cases + +| Scenario | Behavior | +|---|---| +| Probe toggled off mid-session | Continuation still emits if attached at request time. Acceptable — capturing the cause once matters more than honoring runtime toggle. | +| Continuation fires after adapter disposed | Harmless console write on dying process. No memory leak; closure captures only the `ulong` cellId. | +| Same cell requested twice | `_metadataPopulated.Add(id)` guards; continuation attaches exactly once. Re-streaming after Remove+Add keeps the sticky set. First failure is what we want. | +| Cancellation | `t.IsCanceled` is neither `IsFaulted` nor `IsCompletedSuccessfully`. Continuation silently skips. Acceptable — cancellation isn't a failure cause. | +| `Task.Result` on faulted task | Re-throws AggregateException. Our gate `else if (t.IsCompletedSuccessfully && t.Result is null)` ensures we never read Result without a clean success state. | +| WB's `_logger.LogError` for the same exception | WbMeshAdapter passes `NullLogger` — WB's log goes nowhere. Our continuation is what surfaces it. Discussed below. | + +**Why not just inject a real logger into `ObjectMeshManager`?** Could replace `NullLogger.Instance` with a real logger that writes to `Console.WriteLine`. Tradeoff: + +- Real logger: simpler, leverages WB's existing `_logger.LogError` call → catches GfxObj + Setup + EnvCell failures. +- Our continuation: scoped to EnvCell IDs only → less noise. + +Going with the continuation approach because: +1. The probe flag is already in place. +2. Phase 2 is targeted at EnvCells. +3. Real-logger would emit thousands of GfxObj/Setup log lines during landblock streaming, drowning the EnvCell signal. + +We can revisit if a future debugging session calls for broader visibility. + +--- + +## 4. Testing strategy + +### Unit tests + +None for Component 1 — the continuation is straight wiring around an async API; the logic is "if faulted, log; if null result, log." Testing requires either mocking `Task` (low value) or running a real WB instance (impractical in unit tests). + +### Visual verification (end-to-end) + +Component 2's capture procedure is the verification mechanism: + +1. Build green. +2. Launch with probe flag on, walk into Holtburg. +3. Confirm `[indoor-upload] FAILED` or `NULL_RESULT` lines appear for ~26 cells. +4. Apply Component 3's fix. +5. Re-launch, re-walk Holtburg. +6. **Acceptance:** previously-failing cells now produce `[indoor-upload] completed` lines AND the user can see the floor in Holtburg Inn. + +--- + +## 5. What's NOT in this phase + +- Tightening `IsEnvCellId` false-positives (flagged in Phase 1 capture note). Deferred — doesn't block Phase 2 since the upload probe gates on the correct path. +- Cell collision symptoms (no wall collision when exiting, weird open-air collisions). Separate investigation phase. +- Stab-leak-through-walls (Phase 1 Task 3). Deferred. +- Broader WB logger injection for GfxObj/Setup failures. Open if we ever want broader diagnostic visibility. + +--- + +## 6. Acceptance criteria + +- [ ] `WbMeshAdapter.IncrementRefCount` captures the prep task and attaches a continuation for EnvCell IDs. +- [ ] Continuation logs `[indoor-upload] FAILED cellId=0x... exception=: stack=[...]` for faulted tasks. +- [ ] Continuation logs `[indoor-upload] NULL_RESULT cellId=0x...` for clean-null returns. +- [ ] `dotnet build` clean. `dotnet test` clean (no new failures; pre-existing 8 physics/input failures unchanged). +- [ ] Capture launched, FAILED/NULL_RESULT lines appear for the previously-missing cells, distinct causes identified. +- [ ] Component 3 fix designed and implemented for each distinct cause. +- [ ] Re-capture confirms `[indoor-upload] completed` appears for cells previously missing. +- [ ] Visual verification: floor renders in Holtburg Inn. +- [ ] Roadmap updated with Phase 2 shipped. +- [ ] Commit messages cite the captured exception types + the fix rationale.