acdream/docs/superpowers/specs/2026-05-19-phase2-indoor-cell-rendering-fix-design.md
Erik 251763b2c4 docs(spec): Phase 2 indoor cell rendering fix design
Three components:
1. WbMeshAdapter wraps the PrepareMeshDataAsync task with a continuation
   that surfaces faulted-task exceptions + null-result cases for EnvCell
   IDs only (gated by ProbeIndoorUploadEnabled). Two new log shapes:
   [indoor-upload] FAILED  cellId=0x... exception=<TypeName>: <Message>
                            stack=[<top 3 frames>]
   [indoor-upload] NULL_RESULT cellId=0x...
2. Capture procedure: re-launch at Holtburg with the probe on, grep for
   FAILED/NULL_RESULT lines, get definitive per-cell cause for the 26
   missing-completion cells from Phase 1's capture.
3. Targeted fix: code change matching whichever exception type / null
   pattern dominates. Fix shape is data-driven — see the contingency
   table in the spec.

WB's catch at ObjectMeshManager.cs:589 already calls _logger.LogError,
but WbMeshAdapter constructs the manager with NullLogger.Instance, so
the log is dropped. Our continuation surfaces the same data scoped to
EnvCells only (avoids the thousands of GfxObj/Setup log lines a real
logger would emit during landblock streaming).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:18:04 +02:00

189 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Indoor Cell Rendering Fix — Phase 2 Design
**Status:** Brainstormed 2026-05-19. Awaiting user review.
**Scope:** Surface the silent failure in WB's `PrepareEnvCellMeshData` for 26/123 Holtburg cells, then implement the targeted fix.
**Predecessor:** Phase 1 (`docs/superpowers/specs/2026-05-19-indoor-cell-rendering-fix-design.md`) shipped the five `[indoor-*]` probes that confirmed hypothesis H1.
**Capture evidence:** `docs/research/2026-05-19-indoor-cell-rendering-probe-capture.md`.
---
## 1. What we know
Phase 1's `ACDREAM_PROBE_INDOOR_ALL=1` capture at Holtburg `0xA9B4` proved:
- 123 EnvCells requested via `WbMeshAdapter.IncrementRefCount` → only **97 complete**.
- **26 cells** silently fail. They get `[indoor-upload] requested` but never `[indoor-upload] completed`.
- The dispatcher then tries to draw them, `TryGetRenderData` returns null, draw is silently skipped → user sees **missing floor**.
- The first interior cell `0xA9B40100` (likely the inn entry or another major building anchor) is among the 26.
The smoking gun is in WB's [`ObjectMeshManager.PrepareMeshData`](../../../references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs):
```csharp
catch (Exception ex) {
_logger.LogError(ex, "Error preparing mesh data for 0x{Id:X16}", id);
return null;
}
```
WB logs the exception via its injected `_logger`. But [`WbMeshAdapter.cs:71`](../../../src/AcDream.App/Rendering/Wb/WbMeshAdapter.cs:71) constructs `ObjectMeshManager` with `NullLogger<ObjectMeshManager>.Instance` — so the log goes to `/dev/null`. The exception type and message are lost.
## 2. Solution — three components
### Component 1 — Exception-surfacing wrap
Capture the `Task<ObjectMeshData?>` returned by `_meshManager.PrepareMeshDataAsync(id, isSetup: false)` and attach a continuation that, for EnvCell IDs only, logs the failure cause.
Three logged outcomes:
- **Task faulted** → `[indoor-upload] FAILED cellId=0x... exception=<TypeName>: <Message> stack=[<top 3 frames>]`. Unwrap `AggregateException.InnerException` for cleaner output.
- **Task succeeded with null result** → `[indoor-upload] NULL_RESULT cellId=0x...`. WB's deliberate null-return path (e.g., `ResolveId` returned empty, type was `Unknown`).
- **Task succeeded with non-null result** → no extra log. The existing `Tick()` drain already emits `[indoor-upload] completed`.
The continuation:
- Runs on `TaskScheduler.Default` (`ThreadPool`) so it doesn't block the render thread.
- Only attached for EnvCell IDs (gated by `RenderingDiagnostics.IsEnvCellId(id)`) when `ProbeIndoorUploadEnabled` is true — zero cost when off.
- Captures `cellId` (a `ulong` value) only; no instance closure leakage.
- Truncates stack trace to top 3 frames.
Concrete code shape:
```csharp
if (_metadataPopulated.Add(id))
{
PopulateMetadata(id);
var prepTask = _meshManager.PrepareMeshDataAsync(id, isSetup: false);
if (RenderingDiagnostics.IsEnvCellId(id) && RenderingDiagnostics.ProbeIndoorUploadEnabled)
{
_pendingEnvCellRequests.Add(id);
Console.WriteLine($"[indoor-upload] requested cellId=0x{id:X8}");
ulong cellId = id;
_ = prepTask.ContinueWith(t =>
{
if (t.IsFaulted && t.Exception is not null)
{
var ex = t.Exception.InnerException ?? t.Exception;
var stack = (ex.StackTrace ?? "").Split('\n')
.Take(3).Select(s => s.Trim()).Where(s => s.Length > 0);
Console.WriteLine(
$"[indoor-upload] FAILED cellId=0x{cellId:X8} " +
$"exception={ex.GetType().Name}: {ex.Message} " +
$"stack=[{string.Join(" | ", stack)}]");
}
else if (t.IsCompletedSuccessfully && t.Result is null)
{
Console.WriteLine($"[indoor-upload] NULL_RESULT cellId=0x{cellId:X8}");
}
}, TaskScheduler.Default);
}
}
```
`using System.Linq;` and `using System.Threading.Tasks;` may need adding (likely already present).
### Component 2 — Capture procedure
Standard launch:
```powershell
$env:ACDREAM_PROBE_INDOOR_UPLOAD = "1"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath launch.log
```
Walk into Holtburg Inn, walk into nearby buildings whose cells were on the missing-26 list (`0xA9B40100`, `0xA9B40111`, etc.). Close gracefully.
Analyze:
```powershell
Get-Content launch.log |
Where-Object { $_ -match '\[indoor-upload\] (FAILED|NULL_RESULT)' } |
Select-Object -Unique
```
Expected output: a per-cell list of distinct exception types or null-return signals. Most cells likely share 13 root causes.
### Component 3 — Targeted fix (shape unknown until Component 2 captures)
Once Component 2 reveals the exception type + message, the fix is one localized code change. Likely shapes:
| Captured cause | Fix shape |
|---|---|
| Texture decode `Exception` (e.g. `KeyNotFoundException` on surface ID) | Guard at `WbMeshAdapter.PopulateMetadata` or pre-validate surfaces; possibly patch WB fork. |
| `KeyNotFoundException` for missing `Environment` / `CellStruct` | Log + skip cell with a sentinel render-data; report which dat is stale. |
| `NullReferenceException` in `PrepareCellStructMeshData` | Add null guard at the specific call site. |
| WB internal logic bug | Fork patch to WB. |
| `NULL_RESULT` (ResolveId returned empty / type was Unknown) | Investigate dat file integrity; possibly user needs a dat update. |
The fix is one or two code edits, lands as a single commit, and is followed by a re-launch verifying:
- `[indoor-upload] FAILED` / `NULL_RESULT` lines disappear for the previously-failing cells.
- `[indoor-upload] completed` appears for those cells.
- Visual verification: floor renders in Holtburg Inn.
---
## 3. Edge cases
| Scenario | Behavior |
|---|---|
| Probe toggled off mid-session | Continuation still emits if attached at request time. Acceptable — capturing the cause once matters more than honoring runtime toggle. |
| Continuation fires after adapter disposed | Harmless console write on dying process. No memory leak; closure captures only the `ulong` cellId. |
| Same cell requested twice | `_metadataPopulated.Add(id)` guards; continuation attaches exactly once. Re-streaming after Remove+Add keeps the sticky set. First failure is what we want. |
| Cancellation | `t.IsCanceled` is neither `IsFaulted` nor `IsCompletedSuccessfully`. Continuation silently skips. Acceptable — cancellation isn't a failure cause. |
| `Task.Result` on faulted task | Re-throws AggregateException. Our gate `else if (t.IsCompletedSuccessfully && t.Result is null)` ensures we never read Result without a clean success state. |
| WB's `_logger.LogError` for the same exception | WbMeshAdapter passes `NullLogger` — WB's log goes nowhere. Our continuation is what surfaces it. Discussed below. |
**Why not just inject a real logger into `ObjectMeshManager`?** Could replace `NullLogger<ObjectMeshManager>.Instance` with a real logger that writes to `Console.WriteLine`. Tradeoff:
- Real logger: simpler, leverages WB's existing `_logger.LogError` call → catches GfxObj + Setup + EnvCell failures.
- Our continuation: scoped to EnvCell IDs only → less noise.
Going with the continuation approach because:
1. The probe flag is already in place.
2. Phase 2 is targeted at EnvCells.
3. Real-logger would emit thousands of GfxObj/Setup log lines during landblock streaming, drowning the EnvCell signal.
We can revisit if a future debugging session calls for broader visibility.
---
## 4. Testing strategy
### Unit tests
None for Component 1 — the continuation is straight wiring around an async API; the logic is "if faulted, log; if null result, log." Testing requires either mocking `Task<ObjectMeshData?>` (low value) or running a real WB instance (impractical in unit tests).
### Visual verification (end-to-end)
Component 2's capture procedure is the verification mechanism:
1. Build green.
2. Launch with probe flag on, walk into Holtburg.
3. Confirm `[indoor-upload] FAILED` or `NULL_RESULT` lines appear for ~26 cells.
4. Apply Component 3's fix.
5. Re-launch, re-walk Holtburg.
6. **Acceptance:** previously-failing cells now produce `[indoor-upload] completed` lines AND the user can see the floor in Holtburg Inn.
---
## 5. What's NOT in this phase
- Tightening `IsEnvCellId` false-positives (flagged in Phase 1 capture note). Deferred — doesn't block Phase 2 since the upload probe gates on the correct path.
- Cell collision symptoms (no wall collision when exiting, weird open-air collisions). Separate investigation phase.
- Stab-leak-through-walls (Phase 1 Task 3). Deferred.
- Broader WB logger injection for GfxObj/Setup failures. Open if we ever want broader diagnostic visibility.
---
## 6. Acceptance criteria
- [ ] `WbMeshAdapter.IncrementRefCount` captures the prep task and attaches a continuation for EnvCell IDs.
- [ ] Continuation logs `[indoor-upload] FAILED cellId=0x... exception=<TypeName>: <Message> stack=[...]` for faulted tasks.
- [ ] Continuation logs `[indoor-upload] NULL_RESULT cellId=0x...` for clean-null returns.
- [ ] `dotnet build` clean. `dotnet test` clean (no new failures; pre-existing 8 physics/input failures unchanged).
- [ ] Capture launched, FAILED/NULL_RESULT lines appear for the previously-missing cells, distinct causes identified.
- [ ] Component 3 fix designed and implemented for each distinct cause.
- [ ] Re-capture confirms `[indoor-upload] completed` appears for cells previously missing.
- [ ] Visual verification: floor renders in Holtburg Inn.
- [ ] Roadmap updated with Phase 2 shipped.
- [ ] Commit messages cite the captured exception types + the fix rationale.