Records the 2026-06-09 dat-reader thread-safety investigation: concurrent READS on Chorizite.DatReaderWriter 2.1.7 exonerated (source audit + 1.1M-read hammer,b3920d8); the real crash was dispose-during-read at teardown (fixed,8fadf77); the white-walls mechanism remains open as #105 with every silent dat-miss exit tripwired (7433b70) so the next occurrence self-attributes. Also corrects project lore: the A.1-era rule that all dat reads must stay on one thread does not hold for the 2.1.7 read path, and both investigation subagents'' claimed ReadBlock instance-field race does not exist in the shipped source — verify agent claims against source before acting on them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
8.8 KiB
Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired
Date: 2026-06-09. Branch: claude/thirsty-goldberg-51bb9b.
Trigger: white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior
session's "heavy probes starve the dat-reader" framing as the whole story; user picked the
dat-reader stability fix as the session direction.
Read this before touching any dat-threading code. It REVERSES the long-standing "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it records exactly what was verified, what was refuted, and what remains open.
0. TL;DR
- Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE — verified two independent
ways: (a) a line-level audit of the actual
release/2.1.7source (cross-checked against the decompiled NuGet DLL); (b) an in-tree hammer test (tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs) that replays our four-population access pattern — ~1.1 M concurrent reads, 8 threads, shuffled orders, raw block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded golden fingerprints. - The verified crash bug is teardown, not reads:
ObjectMeshManager.Disposenever quiesced itsTask.Run(ProcessQueueAsync)decode workers andLandblockStreamer.Disposegave up its join after 2 s — thenGameWindow.OnClosingdisposed theDatCollection, which unmaps the dats' memory-mapped views (MemoryMappedBlockAllocator.DestroyMappedFile→_viewPtr = null). A worker still insideReadBlockthen dereferences the dead view pointer → uncatchable AccessViolationException withReadBlockon the stack — the recorded crash signature, firing on close/relaunch during decode storms. FIXED this session (quiesce-before-dispose inObjectMeshManager.Dispose+ entry/loop guards; 15 s loud join inLandblockStreamer.Dispose). - The white-walls mechanism is STILL OPEN (issue #105) — but it is now narrowed and
instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log
line (
[dat-miss]/[tex-miss]/[tex-skip]/[cell-miss]). They fire ONLY on anomaly (zero cost when healthy) and stay in the build; the next organic occurrence self-attributes in the launch log. - Two false trails refuted: both investigation subagents confidently described a
ReadBlockinstance-field cursor race — the fields don't exist in 2.1.7 (all cursor state is method-local over a stable read-only view). One also cited a vendoredreferences/DatReaderWritercopy that doesn't exist. Verify agent claims against source.
1. The threading topology (verified, file:line)
ONE DatCollection (GameWindow.cs:1166, DatAccessType.Read) → four DatDatabases
(Portal/Cell/HighRes/Local), each with its own MemoryMappedBlockAllocator. Reader
populations on the SAME instances:
| Population | Entry | Serialization |
|---|---|---|
| Render thread | CellLoader-side hydration, TextureCache.DecodeFromDats, TickAnimations/MotionResolver, WorldPicker |
_datLock at 3 sites only (GameWindow.cs:2285,5603,8948,11224); the rest unlocked |
| Streamer worker | BuildLandblockForStreaming (GameWindow.cs:5142) |
the whole load under GameWindow._datLock |
| Decode pool (≤4) | ObjectMeshManager.ProcessQueueAsync → _dats.* via DatCollectionAdapter |
DatDatabaseWrapper._lock (per-db, DatCollectionAdapter.cs:120) — a different lock object than _datLock |
| Decode pool, raw | DatCollectionAdapter.ResolveId (DatCollectionAdapter.cs:74-96) |
none (raw Tree.TryGetFile) |
Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant. Post-investigation verdict: that concurrency is fine for reads (see §2) — the locks are not protecting against anything real on the read path. Do NOT "unify the lock domains" as a fix for a read race that doesn't exist; the locks' remaining value is incidental (they serialize some of our own cache fills).
2. Why the read path is safe (audit summary, v2.1.7 source)
MemoryMappedBlockAllocator.ReadBlock(Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160): chain cursor + buffer offset are locals; copies are destination-bounded;_viewPtris written only in the ctor andExpand(write-path only — never runs forDatAccessType.Read).- BTree:
DatBTreeReaderWriternode cache is an internally-locked LRU (:62-126); traversal state is local; nodes are immutable post-unpack. DatDatabase.TryGet<T>: rents from the thread-safeArrayPool, fills viaReadBlock, unpacks through a freshDatBinReader(per-instance_offset), caches into aConcurrentDictionary.ObjectFactory=ConcurrentDictionary.GetOrAdd;DBObjAttributeCache= benign last-wins lazy init of immutable dictionaries.- Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom):
_fileCache[fileId] = valueis published one statement beforevalue.Id = fileId(DatDatabase.cs:389-393) — a racing reader can seeId == 0. Upstream PR candidate.
Correction to project lore: feedback_phase_a1_hotfix_saga.md's "DatBinReader holds a
buffer position field per database; concurrent Get<T> corrupts" does NOT describe 2.1.7 —
DatBinReader is created per call. The A.1-era crash predates Phase O's single-reader
unification and was likely a different stack (or version). The A.5 spec's "single-threaded by
construction" claim was composition-false the day it shipped (N.4's pool decodes already
existed) — but per this investigation it also didn't matter for correctness.
3. The teardown fix (this session)
ObjectMeshManager.Dispose(src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs): setsIsDisposedunder the queue lock, cancels + drains_pendingRequests, then waits (≤10 s, 5 ms poll) for_activeWorkers == 0before returning; logs an error if workers outlive the wait.ProcessQueueAsyncre-checksIsDisposedbefore every dequeue; bothPrepare*Asyncentries and enqueue blocks early-out when disposed.LandblockStreamer.Dispose(src/AcDream.App/Streaming/LandblockStreamer.cs): join extended 2 s → 15 s with a loud[streamer]error line on timeout (cancellation is honored between jobs, so the wait is bounded by one landblock load).GameWindow.OnClosingorder was already correct (streamer → mesh adapter → … →_datslast); with both Disposes now actually quiescing, the order is meaningful.- Audio was checked and is NOT a dat-reading population (no
_datsaccess undersrc/AcDream.App/Audio/).
Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit).
4. The white-walls residual (#105) + tripwires
Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while
the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session;
strictly intermittent (one clean launch had it, a heavier probe launch didn't); zero error
output ever — today's white-wall launch log was 35 lines, no [wb-error].
All silent exits found and tripwired (they print ONLY on anomaly; keep them):
| Line | Where | Meaning when it fires |
|---|---|---|
[dat-miss] |
DatDatabaseWrapper.TryGet (DatCollectionAdapter.cs) |
a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault |
[tex-skip] |
ObjectMeshManager GfxObj + CellStruct texture chains (5 exits) |
a polygon batch (incl. WALL batches) was silently dropped on a dat miss |
[tex-miss] |
TextureCache.DecodeFromDats (3 exits) |
render-thread texture decode fell back to magenta |
[cell-miss] |
GameWindow interior hydration (EnvCell / Environment null) |
a cell's walls were silently never registered — the exact white-walls geometry signature |
Discriminator for the next occurrence: magenta walls → TextureCache path; see-through
walls + [tex-skip]/[dat-miss] → mesh-build path; see-through + [cell-miss] →
hydration; white/missing with NO tripwire output → the failure is GL-side (staged upload /
bindless residency under load) — instrument WbMeshAdapter.Tick's upload drain next.
5. Apparatus
tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs— the hammer (8 threads × 25 shuffled passes; rawTryGetFileBytes+ typedTryGetwithFileCachingStrategy.Never; golden FNV-1a fingerprints). Skips cleanly when dats absent. KEEP as the regression guard for any future dat-reader version bump.- Package source for reference:
git clone --depth 1 --branch release/2.1.7 https://github.com/Chorizite/DatReaderWriter(tag verified == shipped NuGet DLL). - The tripwire lines (§4) — grep any launch log for
dat-miss|tex-miss|tex-skip|cell-miss.