acdream/docs/research/2026-06-09-dat-reader-thread-safety-investigation.md
Erik 8dc707d43b docs: dat-reader investigation handoff + file #105 (white walls, tripwired)
Records the 2026-06-09 dat-reader thread-safety investigation: concurrent
READS on Chorizite.DatReaderWriter 2.1.7 exonerated (source audit + 1.1M-read
hammer, b3920d8); the real crash was dispose-during-read at teardown (fixed,
8fadf77); the white-walls mechanism remains open as #105 with every silent
dat-miss exit tripwired (7433b70) so the next occurrence self-attributes.

Also corrects project lore: the A.1-era rule that all dat reads must stay on
one thread does not hold for the 2.1.7 read path, and both investigation
subagents'' claimed ReadBlock instance-field race does not exist in the
shipped source — verify agent claims against source before acting on them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 21:29:06 +02:00

8.8 KiB
Raw Blame History

Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired

Date: 2026-06-09. Branch: claude/thirsty-goldberg-51bb9b. Trigger: white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior session's "heavy probes starve the dat-reader" framing as the whole story; user picked the dat-reader stability fix as the session direction.

Read this before touching any dat-threading code. It REVERSES the long-standing "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it records exactly what was verified, what was refuted, and what remains open.


0. TL;DR

  1. Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE — verified two independent ways: (a) a line-level audit of the actual release/2.1.7 source (cross-checked against the decompiled NuGet DLL); (b) an in-tree hammer test (tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs) that replays our four-population access pattern — ~1.1 M concurrent reads, 8 threads, shuffled orders, raw block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded golden fingerprints.
  2. The verified crash bug is teardown, not reads: ObjectMeshManager.Dispose never quiesced its Task.Run(ProcessQueueAsync) decode workers and LandblockStreamer.Dispose gave up its join after 2 s — then GameWindow.OnClosing disposed the DatCollection, which unmaps the dats' memory-mapped views (MemoryMappedBlockAllocator.DestroyMappedFile_viewPtr = null). A worker still inside ReadBlock then dereferences the dead view pointer → uncatchable AccessViolationException with ReadBlock on the stack — the recorded crash signature, firing on close/relaunch during decode storms. FIXED this session (quiesce-before-dispose in ObjectMeshManager.Dispose + entry/loop guards; 15 s loud join in LandblockStreamer.Dispose).
  3. The white-walls mechanism is STILL OPEN (issue #105) — but it is now narrowed and instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log line ([dat-miss] / [tex-miss] / [tex-skip] / [cell-miss]). They fire ONLY on anomaly (zero cost when healthy) and stay in the build; the next organic occurrence self-attributes in the launch log.
  4. Two false trails refuted: both investigation subagents confidently described a ReadBlock instance-field cursor race — the fields don't exist in 2.1.7 (all cursor state is method-local over a stable read-only view). One also cited a vendored references/DatReaderWriter copy that doesn't exist. Verify agent claims against source.

1. The threading topology (verified, file:line)

ONE DatCollection (GameWindow.cs:1166, DatAccessType.Read) → four DatDatabases (Portal/Cell/HighRes/Local), each with its own MemoryMappedBlockAllocator. Reader populations on the SAME instances:

Population Entry Serialization
Render thread CellLoader-side hydration, TextureCache.DecodeFromDats, TickAnimations/MotionResolver, WorldPicker _datLock at 3 sites only (GameWindow.cs:2285,5603,8948,11224); the rest unlocked
Streamer worker BuildLandblockForStreaming (GameWindow.cs:5142) the whole load under GameWindow._datLock
Decode pool (≤4) ObjectMeshManager.ProcessQueueAsync_dats.* via DatCollectionAdapter DatDatabaseWrapper._lock (per-db, DatCollectionAdapter.cs:120) — a different lock object than _datLock
Decode pool, raw DatCollectionAdapter.ResolveId (DatCollectionAdapter.cs:74-96) none (raw Tree.TryGetFile)

Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant. Post-investigation verdict: that concurrency is fine for reads (see §2) — the locks are not protecting against anything real on the read path. Do NOT "unify the lock domains" as a fix for a read race that doesn't exist; the locks' remaining value is incidental (they serialize some of our own cache fills).

2. Why the read path is safe (audit summary, v2.1.7 source)

  • MemoryMappedBlockAllocator.ReadBlock (Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160): chain cursor + buffer offset are locals; copies are destination-bounded; _viewPtr is written only in the ctor and Expand (write-path only — never runs for DatAccessType.Read).
  • BTree: DatBTreeReaderWriter node cache is an internally-locked LRU (:62-126); traversal state is local; nodes are immutable post-unpack.
  • DatDatabase.TryGet<T>: rents from the thread-safe ArrayPool, fills via ReadBlock, unpacks through a fresh DatBinReader (per-instance _offset), caches into a ConcurrentDictionary. ObjectFactory = ConcurrentDictionary.GetOrAdd; DBObjAttributeCache = benign last-wins lazy init of immutable dictionaries.
  • Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom): _fileCache[fileId] = value is published one statement before value.Id = fileId (DatDatabase.cs:389-393) — a racing reader can see Id == 0. Upstream PR candidate.

Correction to project lore: feedback_phase_a1_hotfix_saga.md's "DatBinReader holds a buffer position field per database; concurrent Get<T> corrupts" does NOT describe 2.1.7 — DatBinReader is created per call. The A.1-era crash predates Phase O's single-reader unification and was likely a different stack (or version). The A.5 spec's "single-threaded by construction" claim was composition-false the day it shipped (N.4's pool decodes already existed) — but per this investigation it also didn't matter for correctness.

3. The teardown fix (this session)

  • ObjectMeshManager.Dispose (src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs): sets IsDisposed under the queue lock, cancels + drains _pendingRequests, then waits (≤10 s, 5 ms poll) for _activeWorkers == 0 before returning; logs an error if workers outlive the wait. ProcessQueueAsync re-checks IsDisposed before every dequeue; both Prepare*Async entries and enqueue blocks early-out when disposed.
  • LandblockStreamer.Dispose (src/AcDream.App/Streaming/LandblockStreamer.cs): join extended 2 s → 15 s with a loud [streamer] error line on timeout (cancellation is honored between jobs, so the wait is bounded by one landblock load).
  • GameWindow.OnClosing order was already correct (streamer → mesh adapter → … → _dats last); with both Disposes now actually quiescing, the order is meaningful.
  • Audio was checked and is NOT a dat-reading population (no _dats access under src/AcDream.App/Audio/).

Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit).

4. The white-walls residual (#105) + tripwires

Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session; strictly intermittent (one clean launch had it, a heavier probe launch didn't); zero error output ever — today's white-wall launch log was 35 lines, no [wb-error].

All silent exits found and tripwired (they print ONLY on anomaly; keep them):

Line Where Meaning when it fires
[dat-miss] DatDatabaseWrapper.TryGet (DatCollectionAdapter.cs) a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault
[tex-skip] ObjectMeshManager GfxObj + CellStruct texture chains (5 exits) a polygon batch (incl. WALL batches) was silently dropped on a dat miss
[tex-miss] TextureCache.DecodeFromDats (3 exits) render-thread texture decode fell back to magenta
[cell-miss] GameWindow interior hydration (EnvCell / Environment null) a cell's walls were silently never registered — the exact white-walls geometry signature

Discriminator for the next occurrence: magenta walls → TextureCache path; see-through walls + [tex-skip]/[dat-miss] → mesh-build path; see-through + [cell-miss] → hydration; white/missing with NO tripwire output → the failure is GL-side (staged upload / bindless residency under load) — instrument WbMeshAdapter.Tick's upload drain next.

5. Apparatus

  • tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs — the hammer (8 threads × 25 shuffled passes; raw TryGetFileBytes + typed TryGet with FileCachingStrategy.Never; golden FNV-1a fingerprints). Skips cleanly when dats absent. KEEP as the regression guard for any future dat-reader version bump.
  • Package source for reference: git clone --depth 1 --branch release/2.1.7 https://github.com/Chorizite/DatReaderWriter (tag verified == shipped NuGet DLL).
  • The tripwire lines (§4) — grep any launch log for dat-miss|tex-miss|tex-skip|cell-miss.