diff --git a/docs/ISSUES.md b/docs/ISSUES.md index e923a554..de9c7915 100644 --- a/docs/ISSUES.md +++ b/docs/ISSUES.md @@ -44,6 +44,47 @@ Copy this block when adding a new issue: --- +## #105 — Intermittent silent dat-load failure: white/missing cottage walls until relaunch + +**Status:** OPEN +**Severity:** HIGH +**Filed:** 2026-06-09 +**Component:** render, dat, streaming + +**Description:** Intermittently (twice user-confirmed: once under heavy probe logging +2026-06-08, once on a completely clean launch 2026-06-09), the Holtburg cottage wall surfaces +render as background/clear color while the cells' static objects (paintings, furniture, +windows) draw normally. Once broken, broken for the whole session — the failed result is +cached (mesh batches build once, hydration runs once). The failure has never produced a single +log line: every dat-read failure on the walls-relevant paths exits silently. + +**Root cause / status:** NOT YET ATTRIBUTED — but heavily narrowed (full evidence: +`docs/research/2026-06-09-dat-reader-thread-safety-investigation.md`). Refuted: the dat +library's concurrent read path (Chorizite.DatReaderWriter 2.1.7) is exonerated by line-level +source audit + the in-tree hammer test (`DatConcurrencyStressTests`, ~1.1M concurrent reads, +zero anomalies). Fixed nearby: the teardown dispose-during-read AccessViolation (`8fadf77`). +Remaining candidates: a lifecycle/ordering race in our hydration/streaming pipeline, or a +GL-side staged-upload / bindless-residency failure under load. Tripwire log lines +(`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`, commit `7433b70`) now cover every +silent exit; the next occurrence self-attributes in the launch log. Discriminator: magenta +walls = TextureCache decode miss; see-through + `[tex-skip]`/`[dat-miss]` = mesh-build miss; +see-through + `[cell-miss]` = hydration miss; broken with NO tripwire output = GL-side — +instrument `WbMeshAdapter.Tick`'s upload drain next. + +**Files:** `src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs` (texture chains), +`src/AcDream.App/Rendering/Wb/DatCollectionAdapter.cs` (`DatDatabaseWrapper.TryGet`), +`src/AcDream.App/Rendering/TextureCache.cs` (`DecodeFromDats`), +`src/AcDream.App/Rendering/GameWindow.cs` (`BuildInteriorEntitiesForStreaming` hydration). + +**Research:** `docs/research/2026-06-09-dat-reader-thread-safety-investigation.md`. + +**Acceptance:** the next white-wall occurrence is attributed by a tripwire line (or by their +collective silence, pointing GL-side); the attributed root cause is fixed; walls survive 10+ +launches including heavy-load ones. Tripwires can then be stripped (or kept as permanent +anomaly logging — decide at close). + +--- + ## #104 — Scene VFX particles not clipped to the PView visible cell set **Status:** OPEN diff --git a/docs/research/2026-06-09-dat-reader-thread-safety-investigation.md b/docs/research/2026-06-09-dat-reader-thread-safety-investigation.md new file mode 100644 index 00000000..f66f825c --- /dev/null +++ b/docs/research/2026-06-09-dat-reader-thread-safety-investigation.md @@ -0,0 +1,131 @@ +# Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired + +**Date:** 2026-06-09. **Branch:** `claude/thirsty-goldberg-51bb9b`. +**Trigger:** white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior +session's "heavy probes starve the dat-reader" framing as the whole story; user picked the +dat-reader stability fix as the session direction. + +> Read this before touching any dat-threading code. It REVERSES the long-standing +> "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it +> records exactly what was verified, what was refuted, and what remains open. + +--- + +## 0. TL;DR + +1. **Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE** — verified two independent + ways: (a) a line-level audit of the actual `release/2.1.7` source (cross-checked against the + decompiled NuGet DLL); (b) an in-tree hammer test + (`tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs`) that replays our + four-population access pattern — **~1.1 M concurrent reads, 8 threads, shuffled orders, raw + block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded + golden fingerprints.** +2. **The verified crash bug is teardown, not reads:** `ObjectMeshManager.Dispose` never + quiesced its `Task.Run(ProcessQueueAsync)` decode workers and `LandblockStreamer.Dispose` + gave up its join after 2 s — then `GameWindow.OnClosing` disposed the `DatCollection`, which + unmaps the dats' memory-mapped views (`MemoryMappedBlockAllocator.DestroyMappedFile` → + `_viewPtr = null`). A worker still inside `ReadBlock` then dereferences the dead view + pointer → **uncatchable AccessViolationException with `ReadBlock` on the stack** — the + recorded crash signature, firing on close/relaunch during decode storms. **FIXED this + session** (quiesce-before-dispose in `ObjectMeshManager.Dispose` + entry/loop guards; + 15 s loud join in `LandblockStreamer.Dispose`). +3. **The white-walls mechanism is STILL OPEN** (issue #105) — but it is now narrowed and + instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log + line (`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`). They fire ONLY on anomaly + (zero cost when healthy) and stay in the build; the next organic occurrence self-attributes + in the launch log. +4. **Two false trails refuted:** both investigation subagents confidently described a + `ReadBlock` instance-field cursor race — the fields don't exist in 2.1.7 (all cursor state + is method-local over a stable read-only view). One also cited a vendored + `references/DatReaderWriter` copy that doesn't exist. Verify agent claims against source. + +--- + +## 1. The threading topology (verified, file:line) + +ONE `DatCollection` (`GameWindow.cs:1166`, `DatAccessType.Read`) → four `DatDatabase`s +(Portal/Cell/HighRes/Local), each with its own `MemoryMappedBlockAllocator`. Reader +populations on the SAME instances: + +| Population | Entry | Serialization | +|---|---|---| +| Render thread | `CellLoader`-side hydration, `TextureCache.DecodeFromDats`, `TickAnimations`/`MotionResolver`, `WorldPicker` | `_datLock` at 3 sites only (`GameWindow.cs:2285,5603,8948,11224`); the rest **unlocked** | +| Streamer worker | `BuildLandblockForStreaming` (`GameWindow.cs:5142`) | the whole load under `GameWindow._datLock` | +| Decode pool (≤4) | `ObjectMeshManager.ProcessQueueAsync` → `_dats.*` via `DatCollectionAdapter` | `DatDatabaseWrapper._lock` (per-db, `DatCollectionAdapter.cs:120`) — **a different lock object than `_datLock`** | +| Decode pool, raw | `DatCollectionAdapter.ResolveId` (`DatCollectionAdapter.cs:74-96`) | **none** (raw `Tree.TryGetFile`) | + +Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant. +**Post-investigation verdict: that concurrency is fine for reads** (see §2) — the locks are +not protecting against anything real on the read path. Do NOT "unify the lock domains" as a +fix for a read race that doesn't exist; the locks' remaining value is incidental (they +serialize some of our own cache fills). + +## 2. Why the read path is safe (audit summary, v2.1.7 source) + +- `MemoryMappedBlockAllocator.ReadBlock` (`Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160`): + chain cursor + buffer offset are **locals**; copies are destination-bounded; `_viewPtr` is + written only in the ctor and `Expand` (write-path only — never runs for `DatAccessType.Read`). +- BTree: `DatBTreeReaderWriter` node cache is an internally-locked LRU (`:62-126`); + traversal state is local; nodes are immutable post-unpack. +- `DatDatabase.TryGet`: rents from the thread-safe `ArrayPool`, fills via `ReadBlock`, + unpacks through a **fresh** `DatBinReader` (per-instance `_offset`), caches into a + `ConcurrentDictionary`. `ObjectFactory` = `ConcurrentDictionary.GetOrAdd`; + `DBObjAttributeCache` = benign last-wins lazy init of immutable dictionaries. +- Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom): + `_fileCache[fileId] = value` is published one statement before `value.Id = fileId` + (`DatDatabase.cs:389-393`) — a racing reader can see `Id == 0`. Upstream PR candidate. + +**Correction to project lore:** `feedback_phase_a1_hotfix_saga.md`'s "DatBinReader holds a +buffer position field per database; concurrent `Get` corrupts" does NOT describe 2.1.7 — +`DatBinReader` is created per call. The A.1-era crash predates Phase O's single-reader +unification and was likely a different stack (or version). The A.5 spec's "single-threaded by +construction" claim was composition-false the day it shipped (N.4's pool decodes already +existed) — but per this investigation it also didn't matter for correctness. + +## 3. The teardown fix (this session) + +- `ObjectMeshManager.Dispose` (`src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs`): sets + `IsDisposed` under the queue lock, cancels + drains `_pendingRequests`, then waits (≤10 s, + 5 ms poll) for `_activeWorkers == 0` before returning; logs an error if workers outlive the + wait. `ProcessQueueAsync` re-checks `IsDisposed` before every dequeue; both `Prepare*Async` + entries and enqueue blocks early-out when disposed. +- `LandblockStreamer.Dispose` (`src/AcDream.App/Streaming/LandblockStreamer.cs`): join + extended 2 s → 15 s with a loud `[streamer]` error line on timeout (cancellation is honored + between jobs, so the wait is bounded by one landblock load). +- `GameWindow.OnClosing` order was already correct (streamer → mesh adapter → … → `_dats` + last); with both Disposes now actually quiescing, the order is meaningful. +- Audio was checked and is NOT a dat-reading population (no `_dats` access under + `src/AcDream.App/Audio/`). + +Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit). + +## 4. The white-walls residual (#105) + tripwires + +Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while +the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session; +strictly intermittent (one clean launch had it, a heavier probe launch didn't); **zero error +output ever** — today's white-wall launch log was 35 lines, no `[wb-error]`. + +All silent exits found and tripwired (they print ONLY on anomaly; keep them): + +| Line | Where | Meaning when it fires | +|---|---|---| +| `[dat-miss]` | `DatDatabaseWrapper.TryGet` (`DatCollectionAdapter.cs`) | a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault | +| `[tex-skip]` | `ObjectMeshManager` GfxObj + CellStruct texture chains (5 exits) | a polygon batch (incl. WALL batches) was silently dropped on a dat miss | +| `[tex-miss]` | `TextureCache.DecodeFromDats` (3 exits) | render-thread texture decode fell back to magenta | +| `[cell-miss]` | `GameWindow` interior hydration (EnvCell / Environment null) | a cell's walls were silently never registered — the exact white-walls geometry signature | + +Discriminator for the next occurrence: **magenta walls** → TextureCache path; **see-through +walls + `[tex-skip]`/`[dat-miss]`** → mesh-build path; **see-through + `[cell-miss]`** → +hydration; **white/missing with NO tripwire output** → the failure is GL-side (staged upload / +bindless residency under load) — instrument `WbMeshAdapter.Tick`'s upload drain next. + +## 5. Apparatus + +- `tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs` — the hammer + (8 threads × 25 shuffled passes; raw `TryGetFileBytes` + typed `TryGet` with + `FileCachingStrategy.Never`; golden FNV-1a fingerprints). Skips cleanly when dats absent. + KEEP as the regression guard for any future dat-reader version bump. +- Package source for reference: `git clone --depth 1 --branch release/2.1.7 + https://github.com/Chorizite/DatReaderWriter` (tag verified == shipped NuGet DLL). +- The tripwire lines (§4) — grep any launch log for `dat-miss|tex-miss|tex-skip|cell-miss`.