# Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired **Date:** 2026-06-09. **Branch:** `claude/thirsty-goldberg-51bb9b`. **Trigger:** white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior session's "heavy probes starve the dat-reader" framing as the whole story; user picked the dat-reader stability fix as the session direction. > Read this before touching any dat-threading code. It REVERSES the long-standing > "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it > records exactly what was verified, what was refuted, and what remains open. --- ## 0. TL;DR 1. **Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE** — verified two independent ways: (a) a line-level audit of the actual `release/2.1.7` source (cross-checked against the decompiled NuGet DLL); (b) an in-tree hammer test (`tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs`) that replays our four-population access pattern — **~1.1 M concurrent reads, 8 threads, shuffled orders, raw block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded golden fingerprints.** 2. **The verified crash bug is teardown, not reads:** `ObjectMeshManager.Dispose` never quiesced its `Task.Run(ProcessQueueAsync)` decode workers and `LandblockStreamer.Dispose` gave up its join after 2 s — then `GameWindow.OnClosing` disposed the `DatCollection`, which unmaps the dats' memory-mapped views (`MemoryMappedBlockAllocator.DestroyMappedFile` → `_viewPtr = null`). A worker still inside `ReadBlock` then dereferences the dead view pointer → **uncatchable AccessViolationException with `ReadBlock` on the stack** — the recorded crash signature, firing on close/relaunch during decode storms. **FIXED this session** (quiesce-before-dispose in `ObjectMeshManager.Dispose` + entry/loop guards; 15 s loud join in `LandblockStreamer.Dispose`). 3. **The white-walls mechanism is STILL OPEN** (issue #105) — but it is now narrowed and instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log line (`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`). They fire ONLY on anomaly (zero cost when healthy) and stay in the build; the next organic occurrence self-attributes in the launch log. 4. **Two false trails refuted:** both investigation subagents confidently described a `ReadBlock` instance-field cursor race — the fields don't exist in 2.1.7 (all cursor state is method-local over a stable read-only view). One also cited a vendored `references/DatReaderWriter` copy that doesn't exist. Verify agent claims against source. --- ## 1. The threading topology (verified, file:line) ONE `DatCollection` (`GameWindow.cs:1166`, `DatAccessType.Read`) → four `DatDatabase`s (Portal/Cell/HighRes/Local), each with its own `MemoryMappedBlockAllocator`. Reader populations on the SAME instances: | Population | Entry | Serialization | |---|---|---| | Render thread | `CellLoader`-side hydration, `TextureCache.DecodeFromDats`, `TickAnimations`/`MotionResolver`, `WorldPicker` | `_datLock` at 3 sites only (`GameWindow.cs:2285,5603,8948,11224`); the rest **unlocked** | | Streamer worker | `BuildLandblockForStreaming` (`GameWindow.cs:5142`) | the whole load under `GameWindow._datLock` | | Decode pool (≤4) | `ObjectMeshManager.ProcessQueueAsync` → `_dats.*` via `DatCollectionAdapter` | `DatDatabaseWrapper._lock` (per-db, `DatCollectionAdapter.cs:120`) — **a different lock object than `_datLock`** | | Decode pool, raw | `DatCollectionAdapter.ResolveId` (`DatCollectionAdapter.cs:74-96`) | **none** (raw `Tree.TryGetFile`) | Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant. **Post-investigation verdict: that concurrency is fine for reads** (see §2) — the locks are not protecting against anything real on the read path. Do NOT "unify the lock domains" as a fix for a read race that doesn't exist; the locks' remaining value is incidental (they serialize some of our own cache fills). ## 2. Why the read path is safe (audit summary, v2.1.7 source) - `MemoryMappedBlockAllocator.ReadBlock` (`Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160`): chain cursor + buffer offset are **locals**; copies are destination-bounded; `_viewPtr` is written only in the ctor and `Expand` (write-path only — never runs for `DatAccessType.Read`). - BTree: `DatBTreeReaderWriter` node cache is an internally-locked LRU (`:62-126`); traversal state is local; nodes are immutable post-unpack. - `DatDatabase.TryGet`: rents from the thread-safe `ArrayPool`, fills via `ReadBlock`, unpacks through a **fresh** `DatBinReader` (per-instance `_offset`), caches into a `ConcurrentDictionary`. `ObjectFactory` = `ConcurrentDictionary.GetOrAdd`; `DBObjAttributeCache` = benign last-wins lazy init of immutable dictionaries. - Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom): `_fileCache[fileId] = value` is published one statement before `value.Id = fileId` (`DatDatabase.cs:389-393`) — a racing reader can see `Id == 0`. Upstream PR candidate. **Correction to project lore:** `feedback_phase_a1_hotfix_saga.md`'s "DatBinReader holds a buffer position field per database; concurrent `Get` corrupts" does NOT describe 2.1.7 — `DatBinReader` is created per call. The A.1-era crash predates Phase O's single-reader unification and was likely a different stack (or version). The A.5 spec's "single-threaded by construction" claim was composition-false the day it shipped (N.4's pool decodes already existed) — but per this investigation it also didn't matter for correctness. ## 3. The teardown fix (this session) - `ObjectMeshManager.Dispose` (`src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs`): sets `IsDisposed` under the queue lock, cancels + drains `_pendingRequests`, then waits (≤10 s, 5 ms poll) for `_activeWorkers == 0` before returning; logs an error if workers outlive the wait. `ProcessQueueAsync` re-checks `IsDisposed` before every dequeue; both `Prepare*Async` entries and enqueue blocks early-out when disposed. - `LandblockStreamer.Dispose` (`src/AcDream.App/Streaming/LandblockStreamer.cs`): join extended 2 s → 15 s with a loud `[streamer]` error line on timeout (cancellation is honored between jobs, so the wait is bounded by one landblock load). - `GameWindow.OnClosing` order was already correct (streamer → mesh adapter → … → `_dats` last); with both Disposes now actually quiescing, the order is meaningful. - Audio was checked and is NOT a dat-reading population (no `_dats` access under `src/AcDream.App/Audio/`). Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit). ## 4. The white-walls residual (#105) + tripwires Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session; strictly intermittent (one clean launch had it, a heavier probe launch didn't); **zero error output ever** — today's white-wall launch log was 35 lines, no `[wb-error]`. All silent exits found and tripwired (they print ONLY on anomaly; keep them): | Line | Where | Meaning when it fires | |---|---|---| | `[dat-miss]` | `DatDatabaseWrapper.TryGet` (`DatCollectionAdapter.cs`) | a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault | | `[tex-skip]` | `ObjectMeshManager` GfxObj + CellStruct texture chains (5 exits) | a polygon batch (incl. WALL batches) was silently dropped on a dat miss | | `[tex-miss]` | `TextureCache.DecodeFromDats` (3 exits) | render-thread texture decode fell back to magenta | | `[cell-miss]` | `GameWindow` interior hydration (EnvCell / Environment null) | a cell's walls were silently never registered — the exact white-walls geometry signature | Discriminator for the next occurrence: **magenta walls** → TextureCache path; **see-through walls + `[tex-skip]`/`[dat-miss]`** → mesh-build path; **see-through + `[cell-miss]`** → hydration; **white/missing with NO tripwire output** → the failure is GL-side (staged upload / bindless residency under load) — instrument `WbMeshAdapter.Tick`'s upload drain next. ## 5. Apparatus - `tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs` — the hammer (8 threads × 25 shuffled passes; raw `TryGetFileBytes` + typed `TryGet` with `FileCachingStrategy.Never`; golden FNV-1a fingerprints). Skips cleanly when dats absent. KEEP as the regression guard for any future dat-reader version bump. - Package source for reference: `git clone --depth 1 --branch release/2.1.7 https://github.com/Chorizite/DatReaderWriter` (tag verified == shipped NuGet DLL). - The tripwire lines (§4) — grep any launch log for `dat-miss|tex-miss|tex-skip|cell-miss`.