acdream/docs/research/2026-06-09-dat-reader-thread-safety-investigation.md
Erik 8dc707d43b docs: dat-reader investigation handoff + file #105 (white walls, tripwired)
Records the 2026-06-09 dat-reader thread-safety investigation: concurrent
READS on Chorizite.DatReaderWriter 2.1.7 exonerated (source audit + 1.1M-read
hammer, b3920d8); the real crash was dispose-during-read at teardown (fixed,
8fadf77); the white-walls mechanism remains open as #105 with every silent
dat-miss exit tripwired (7433b70) so the next occurrence self-attributes.

Also corrects project lore: the A.1-era rule that all dat reads must stay on
one thread does not hold for the 2.1.7 read path, and both investigation
subagents'' claimed ReadBlock instance-field race does not exist in the
shipped source — verify agent claims against source before acting on them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 21:29:06 +02:00

131 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired
**Date:** 2026-06-09. **Branch:** `claude/thirsty-goldberg-51bb9b`.
**Trigger:** white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior
session's "heavy probes starve the dat-reader" framing as the whole story; user picked the
dat-reader stability fix as the session direction.
> Read this before touching any dat-threading code. It REVERSES the long-standing
> "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it
> records exactly what was verified, what was refuted, and what remains open.
---
## 0. TL;DR
1. **Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE** — verified two independent
ways: (a) a line-level audit of the actual `release/2.1.7` source (cross-checked against the
decompiled NuGet DLL); (b) an in-tree hammer test
(`tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs`) that replays our
four-population access pattern — **~1.1 M concurrent reads, 8 threads, shuffled orders, raw
block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded
golden fingerprints.**
2. **The verified crash bug is teardown, not reads:** `ObjectMeshManager.Dispose` never
quiesced its `Task.Run(ProcessQueueAsync)` decode workers and `LandblockStreamer.Dispose`
gave up its join after 2 s — then `GameWindow.OnClosing` disposed the `DatCollection`, which
unmaps the dats' memory-mapped views (`MemoryMappedBlockAllocator.DestroyMappedFile`
`_viewPtr = null`). A worker still inside `ReadBlock` then dereferences the dead view
pointer → **uncatchable AccessViolationException with `ReadBlock` on the stack** — the
recorded crash signature, firing on close/relaunch during decode storms. **FIXED this
session** (quiesce-before-dispose in `ObjectMeshManager.Dispose` + entry/loop guards;
15 s loud join in `LandblockStreamer.Dispose`).
3. **The white-walls mechanism is STILL OPEN** (issue #105) — but it is now narrowed and
instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log
line (`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`). They fire ONLY on anomaly
(zero cost when healthy) and stay in the build; the next organic occurrence self-attributes
in the launch log.
4. **Two false trails refuted:** both investigation subagents confidently described a
`ReadBlock` instance-field cursor race — the fields don't exist in 2.1.7 (all cursor state
is method-local over a stable read-only view). One also cited a vendored
`references/DatReaderWriter` copy that doesn't exist. Verify agent claims against source.
---
## 1. The threading topology (verified, file:line)
ONE `DatCollection` (`GameWindow.cs:1166`, `DatAccessType.Read`) → four `DatDatabase`s
(Portal/Cell/HighRes/Local), each with its own `MemoryMappedBlockAllocator`. Reader
populations on the SAME instances:
| Population | Entry | Serialization |
|---|---|---|
| Render thread | `CellLoader`-side hydration, `TextureCache.DecodeFromDats`, `TickAnimations`/`MotionResolver`, `WorldPicker` | `_datLock` at 3 sites only (`GameWindow.cs:2285,5603,8948,11224`); the rest **unlocked** |
| Streamer worker | `BuildLandblockForStreaming` (`GameWindow.cs:5142`) | the whole load under `GameWindow._datLock` |
| Decode pool (≤4) | `ObjectMeshManager.ProcessQueueAsync``_dats.*` via `DatCollectionAdapter` | `DatDatabaseWrapper._lock` (per-db, `DatCollectionAdapter.cs:120`) — **a different lock object than `_datLock`** |
| Decode pool, raw | `DatCollectionAdapter.ResolveId` (`DatCollectionAdapter.cs:74-96`) | **none** (raw `Tree.TryGetFile`) |
Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant.
**Post-investigation verdict: that concurrency is fine for reads** (see §2) — the locks are
not protecting against anything real on the read path. Do NOT "unify the lock domains" as a
fix for a read race that doesn't exist; the locks' remaining value is incidental (they
serialize some of our own cache fills).
## 2. Why the read path is safe (audit summary, v2.1.7 source)
- `MemoryMappedBlockAllocator.ReadBlock` (`Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160`):
chain cursor + buffer offset are **locals**; copies are destination-bounded; `_viewPtr` is
written only in the ctor and `Expand` (write-path only — never runs for `DatAccessType.Read`).
- BTree: `DatBTreeReaderWriter` node cache is an internally-locked LRU (`:62-126`);
traversal state is local; nodes are immutable post-unpack.
- `DatDatabase.TryGet<T>`: rents from the thread-safe `ArrayPool`, fills via `ReadBlock`,
unpacks through a **fresh** `DatBinReader` (per-instance `_offset`), caches into a
`ConcurrentDictionary`. `ObjectFactory` = `ConcurrentDictionary.GetOrAdd`;
`DBObjAttributeCache` = benign last-wins lazy init of immutable dictionaries.
- Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom):
`_fileCache[fileId] = value` is published one statement before `value.Id = fileId`
(`DatDatabase.cs:389-393`) — a racing reader can see `Id == 0`. Upstream PR candidate.
**Correction to project lore:** `feedback_phase_a1_hotfix_saga.md`'s "DatBinReader holds a
buffer position field per database; concurrent `Get<T>` corrupts" does NOT describe 2.1.7 —
`DatBinReader` is created per call. The A.1-era crash predates Phase O's single-reader
unification and was likely a different stack (or version). The A.5 spec's "single-threaded by
construction" claim was composition-false the day it shipped (N.4's pool decodes already
existed) — but per this investigation it also didn't matter for correctness.
## 3. The teardown fix (this session)
- `ObjectMeshManager.Dispose` (`src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs`): sets
`IsDisposed` under the queue lock, cancels + drains `_pendingRequests`, then waits (≤10 s,
5 ms poll) for `_activeWorkers == 0` before returning; logs an error if workers outlive the
wait. `ProcessQueueAsync` re-checks `IsDisposed` before every dequeue; both `Prepare*Async`
entries and enqueue blocks early-out when disposed.
- `LandblockStreamer.Dispose` (`src/AcDream.App/Streaming/LandblockStreamer.cs`): join
extended 2 s → 15 s with a loud `[streamer]` error line on timeout (cancellation is honored
between jobs, so the wait is bounded by one landblock load).
- `GameWindow.OnClosing` order was already correct (streamer → mesh adapter → … → `_dats`
last); with both Disposes now actually quiescing, the order is meaningful.
- Audio was checked and is NOT a dat-reading population (no `_dats` access under
`src/AcDream.App/Audio/`).
Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit).
## 4. The white-walls residual (#105) + tripwires
Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while
the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session;
strictly intermittent (one clean launch had it, a heavier probe launch didn't); **zero error
output ever** — today's white-wall launch log was 35 lines, no `[wb-error]`.
All silent exits found and tripwired (they print ONLY on anomaly; keep them):
| Line | Where | Meaning when it fires |
|---|---|---|
| `[dat-miss]` | `DatDatabaseWrapper.TryGet` (`DatCollectionAdapter.cs`) | a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault |
| `[tex-skip]` | `ObjectMeshManager` GfxObj + CellStruct texture chains (5 exits) | a polygon batch (incl. WALL batches) was silently dropped on a dat miss |
| `[tex-miss]` | `TextureCache.DecodeFromDats` (3 exits) | render-thread texture decode fell back to magenta |
| `[cell-miss]` | `GameWindow` interior hydration (EnvCell / Environment null) | a cell's walls were silently never registered — the exact white-walls geometry signature |
Discriminator for the next occurrence: **magenta walls** → TextureCache path; **see-through
walls + `[tex-skip]`/`[dat-miss]`** → mesh-build path; **see-through + `[cell-miss]`**
hydration; **white/missing with NO tripwire output** → the failure is GL-side (staged upload /
bindless residency under load) — instrument `WbMeshAdapter.Tick`'s upload drain next.
## 5. Apparatus
- `tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs` — the hammer
(8 threads × 25 shuffled passes; raw `TryGetFileBytes` + typed `TryGet` with
`FileCachingStrategy.Never`; golden FNV-1a fingerprints). Skips cleanly when dats absent.
KEEP as the regression guard for any future dat-reader version bump.
- Package source for reference: `git clone --depth 1 --branch release/2.1.7
https://github.com/Chorizite/DatReaderWriter` (tag verified == shipped NuGet DLL).
- The tripwire lines (§4) — grep any launch log for `dat-miss|tex-miss|tex-skip|cell-miss`.