docs(spec): Phase 2 indoor cell rendering fix design

Three components:
1. WbMeshAdapter wraps the PrepareMeshDataAsync task with a continuation
   that surfaces faulted-task exceptions + null-result cases for EnvCell
   IDs only (gated by ProbeIndoorUploadEnabled). Two new log shapes:
   [indoor-upload] FAILED  cellId=0x... exception=<TypeName>: <Message>
                            stack=[<top 3 frames>]
   [indoor-upload] NULL_RESULT cellId=0x...
2. Capture procedure: re-launch at Holtburg with the probe on, grep for
   FAILED/NULL_RESULT lines, get definitive per-cell cause for the 26
   missing-completion cells from Phase 1's capture.
3. Targeted fix: code change matching whichever exception type / null
   pattern dominates. Fix shape is data-driven — see the contingency
   table in the spec.

WB's catch at ObjectMeshManager.cs:589 already calls _logger.LogError,
but WbMeshAdapter constructs the manager with NullLogger.Instance, so
the log is dropped. Our continuation surfaces the same data scoped to
EnvCells only (avoids the thousands of GfxObj/Setup log lines a real
logger would emit during landblock streaming).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erik 2026-05-19 12:18:04 +02:00
parent 9f152d9754
commit 251763b2c4

View file

@ -0,0 +1,189 @@
# Indoor Cell Rendering Fix — Phase 2 Design
**Status:** Brainstormed 2026-05-19. Awaiting user review.
**Scope:** Surface the silent failure in WB's `PrepareEnvCellMeshData` for 26/123 Holtburg cells, then implement the targeted fix.
**Predecessor:** Phase 1 (`docs/superpowers/specs/2026-05-19-indoor-cell-rendering-fix-design.md`) shipped the five `[indoor-*]` probes that confirmed hypothesis H1.
**Capture evidence:** `docs/research/2026-05-19-indoor-cell-rendering-probe-capture.md`.
---
## 1. What we know
Phase 1's `ACDREAM_PROBE_INDOOR_ALL=1` capture at Holtburg `0xA9B4` proved:
- 123 EnvCells requested via `WbMeshAdapter.IncrementRefCount` → only **97 complete**.
- **26 cells** silently fail. They get `[indoor-upload] requested` but never `[indoor-upload] completed`.
- The dispatcher then tries to draw them, `TryGetRenderData` returns null, draw is silently skipped → user sees **missing floor**.
- The first interior cell `0xA9B40100` (likely the inn entry or another major building anchor) is among the 26.
The smoking gun is in WB's [`ObjectMeshManager.PrepareMeshData`](../../../references/WorldBuilder/Chorizite.OpenGLSDLBackend/Lib/ObjectMeshManager.cs):
```csharp
catch (Exception ex) {
_logger.LogError(ex, "Error preparing mesh data for 0x{Id:X16}", id);
return null;
}
```
WB logs the exception via its injected `_logger`. But [`WbMeshAdapter.cs:71`](../../../src/AcDream.App/Rendering/Wb/WbMeshAdapter.cs:71) constructs `ObjectMeshManager` with `NullLogger<ObjectMeshManager>.Instance` — so the log goes to `/dev/null`. The exception type and message are lost.
## 2. Solution — three components
### Component 1 — Exception-surfacing wrap
Capture the `Task<ObjectMeshData?>` returned by `_meshManager.PrepareMeshDataAsync(id, isSetup: false)` and attach a continuation that, for EnvCell IDs only, logs the failure cause.
Three logged outcomes:
- **Task faulted**`[indoor-upload] FAILED cellId=0x... exception=<TypeName>: <Message> stack=[<top 3 frames>]`. Unwrap `AggregateException.InnerException` for cleaner output.
- **Task succeeded with null result**`[indoor-upload] NULL_RESULT cellId=0x...`. WB's deliberate null-return path (e.g., `ResolveId` returned empty, type was `Unknown`).
- **Task succeeded with non-null result** → no extra log. The existing `Tick()` drain already emits `[indoor-upload] completed`.
The continuation:
- Runs on `TaskScheduler.Default` (`ThreadPool`) so it doesn't block the render thread.
- Only attached for EnvCell IDs (gated by `RenderingDiagnostics.IsEnvCellId(id)`) when `ProbeIndoorUploadEnabled` is true — zero cost when off.
- Captures `cellId` (a `ulong` value) only; no instance closure leakage.
- Truncates stack trace to top 3 frames.
Concrete code shape:
```csharp
if (_metadataPopulated.Add(id))
{
PopulateMetadata(id);
var prepTask = _meshManager.PrepareMeshDataAsync(id, isSetup: false);
if (RenderingDiagnostics.IsEnvCellId(id) && RenderingDiagnostics.ProbeIndoorUploadEnabled)
{
_pendingEnvCellRequests.Add(id);
Console.WriteLine($"[indoor-upload] requested cellId=0x{id:X8}");
ulong cellId = id;
_ = prepTask.ContinueWith(t =>
{
if (t.IsFaulted && t.Exception is not null)
{
var ex = t.Exception.InnerException ?? t.Exception;
var stack = (ex.StackTrace ?? "").Split('\n')
.Take(3).Select(s => s.Trim()).Where(s => s.Length > 0);
Console.WriteLine(
$"[indoor-upload] FAILED cellId=0x{cellId:X8} " +
$"exception={ex.GetType().Name}: {ex.Message} " +
$"stack=[{string.Join(" | ", stack)}]");
}
else if (t.IsCompletedSuccessfully && t.Result is null)
{
Console.WriteLine($"[indoor-upload] NULL_RESULT cellId=0x{cellId:X8}");
}
}, TaskScheduler.Default);
}
}
```
`using System.Linq;` and `using System.Threading.Tasks;` may need adding (likely already present).
### Component 2 — Capture procedure
Standard launch:
```powershell
$env:ACDREAM_PROBE_INDOOR_UPLOAD = "1"
dotnet run --project src\AcDream.App\AcDream.App.csproj --no-build -c Debug 2>&1 | Tee-Object -FilePath launch.log
```
Walk into Holtburg Inn, walk into nearby buildings whose cells were on the missing-26 list (`0xA9B40100`, `0xA9B40111`, etc.). Close gracefully.
Analyze:
```powershell
Get-Content launch.log |
Where-Object { $_ -match '\[indoor-upload\] (FAILED|NULL_RESULT)' } |
Select-Object -Unique
```
Expected output: a per-cell list of distinct exception types or null-return signals. Most cells likely share 13 root causes.
### Component 3 — Targeted fix (shape unknown until Component 2 captures)
Once Component 2 reveals the exception type + message, the fix is one localized code change. Likely shapes:
| Captured cause | Fix shape |
|---|---|
| Texture decode `Exception` (e.g. `KeyNotFoundException` on surface ID) | Guard at `WbMeshAdapter.PopulateMetadata` or pre-validate surfaces; possibly patch WB fork. |
| `KeyNotFoundException` for missing `Environment` / `CellStruct` | Log + skip cell with a sentinel render-data; report which dat is stale. |
| `NullReferenceException` in `PrepareCellStructMeshData` | Add null guard at the specific call site. |
| WB internal logic bug | Fork patch to WB. |
| `NULL_RESULT` (ResolveId returned empty / type was Unknown) | Investigate dat file integrity; possibly user needs a dat update. |
The fix is one or two code edits, lands as a single commit, and is followed by a re-launch verifying:
- `[indoor-upload] FAILED` / `NULL_RESULT` lines disappear for the previously-failing cells.
- `[indoor-upload] completed` appears for those cells.
- Visual verification: floor renders in Holtburg Inn.
---
## 3. Edge cases
| Scenario | Behavior |
|---|---|
| Probe toggled off mid-session | Continuation still emits if attached at request time. Acceptable — capturing the cause once matters more than honoring runtime toggle. |
| Continuation fires after adapter disposed | Harmless console write on dying process. No memory leak; closure captures only the `ulong` cellId. |
| Same cell requested twice | `_metadataPopulated.Add(id)` guards; continuation attaches exactly once. Re-streaming after Remove+Add keeps the sticky set. First failure is what we want. |
| Cancellation | `t.IsCanceled` is neither `IsFaulted` nor `IsCompletedSuccessfully`. Continuation silently skips. Acceptable — cancellation isn't a failure cause. |
| `Task.Result` on faulted task | Re-throws AggregateException. Our gate `else if (t.IsCompletedSuccessfully && t.Result is null)` ensures we never read Result without a clean success state. |
| WB's `_logger.LogError` for the same exception | WbMeshAdapter passes `NullLogger` — WB's log goes nowhere. Our continuation is what surfaces it. Discussed below. |
**Why not just inject a real logger into `ObjectMeshManager`?** Could replace `NullLogger<ObjectMeshManager>.Instance` with a real logger that writes to `Console.WriteLine`. Tradeoff:
- Real logger: simpler, leverages WB's existing `_logger.LogError` call → catches GfxObj + Setup + EnvCell failures.
- Our continuation: scoped to EnvCell IDs only → less noise.
Going with the continuation approach because:
1. The probe flag is already in place.
2. Phase 2 is targeted at EnvCells.
3. Real-logger would emit thousands of GfxObj/Setup log lines during landblock streaming, drowning the EnvCell signal.
We can revisit if a future debugging session calls for broader visibility.
---
## 4. Testing strategy
### Unit tests
None for Component 1 — the continuation is straight wiring around an async API; the logic is "if faulted, log; if null result, log." Testing requires either mocking `Task<ObjectMeshData?>` (low value) or running a real WB instance (impractical in unit tests).
### Visual verification (end-to-end)
Component 2's capture procedure is the verification mechanism:
1. Build green.
2. Launch with probe flag on, walk into Holtburg.
3. Confirm `[indoor-upload] FAILED` or `NULL_RESULT` lines appear for ~26 cells.
4. Apply Component 3's fix.
5. Re-launch, re-walk Holtburg.
6. **Acceptance:** previously-failing cells now produce `[indoor-upload] completed` lines AND the user can see the floor in Holtburg Inn.
---
## 5. What's NOT in this phase
- Tightening `IsEnvCellId` false-positives (flagged in Phase 1 capture note). Deferred — doesn't block Phase 2 since the upload probe gates on the correct path.
- Cell collision symptoms (no wall collision when exiting, weird open-air collisions). Separate investigation phase.
- Stab-leak-through-walls (Phase 1 Task 3). Deferred.
- Broader WB logger injection for GfxObj/Setup failures. Open if we ever want broader diagnostic visibility.
---
## 6. Acceptance criteria
- [ ] `WbMeshAdapter.IncrementRefCount` captures the prep task and attaches a continuation for EnvCell IDs.
- [ ] Continuation logs `[indoor-upload] FAILED cellId=0x... exception=<TypeName>: <Message> stack=[...]` for faulted tasks.
- [ ] Continuation logs `[indoor-upload] NULL_RESULT cellId=0x...` for clean-null returns.
- [ ] `dotnet build` clean. `dotnet test` clean (no new failures; pre-existing 8 physics/input failures unchanged).
- [ ] Capture launched, FAILED/NULL_RESULT lines appear for the previously-missing cells, distinct causes identified.
- [ ] Component 3 fix designed and implemented for each distinct cause.
- [ ] Re-capture confirms `[indoor-upload] completed` appears for cells previously missing.
- [ ] Visual verification: floor renders in Holtburg Inn.
- [ ] Roadmap updated with Phase 2 shipped.
- [ ] Commit messages cite the captured exception types + the fix rationale.