docs: dat-reader investigation handoff + file #105 (white walls, tripwired)
Records the 2026-06-09 dat-reader thread-safety investigation: concurrent READS on Chorizite.DatReaderWriter 2.1.7 exonerated (source audit + 1.1M-read hammer,b3920d8); the real crash was dispose-during-read at teardown (fixed,8fadf77); the white-walls mechanism remains open as #105 with every silent dat-miss exit tripwired (7433b70) so the next occurrence self-attributes. Also corrects project lore: the A.1-era rule that all dat reads must stay on one thread does not hold for the 2.1.7 read path, and both investigation subagents'' claimed ReadBlock instance-field race does not exist in the shipped source — verify agent claims against source before acting on them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
b3920d83f6
commit
8dc707d43b
2 changed files with 172 additions and 0 deletions
|
|
@ -0,0 +1,131 @@
|
|||
# Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired
|
||||
|
||||
**Date:** 2026-06-09. **Branch:** `claude/thirsty-goldberg-51bb9b`.
|
||||
**Trigger:** white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior
|
||||
session's "heavy probes starve the dat-reader" framing as the whole story; user picked the
|
||||
dat-reader stability fix as the session direction.
|
||||
|
||||
> Read this before touching any dat-threading code. It REVERSES the long-standing
|
||||
> "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it
|
||||
> records exactly what was verified, what was refuted, and what remains open.
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE** — verified two independent
|
||||
ways: (a) a line-level audit of the actual `release/2.1.7` source (cross-checked against the
|
||||
decompiled NuGet DLL); (b) an in-tree hammer test
|
||||
(`tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs`) that replays our
|
||||
four-population access pattern — **~1.1 M concurrent reads, 8 threads, shuffled orders, raw
|
||||
block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded
|
||||
golden fingerprints.**
|
||||
2. **The verified crash bug is teardown, not reads:** `ObjectMeshManager.Dispose` never
|
||||
quiesced its `Task.Run(ProcessQueueAsync)` decode workers and `LandblockStreamer.Dispose`
|
||||
gave up its join after 2 s — then `GameWindow.OnClosing` disposed the `DatCollection`, which
|
||||
unmaps the dats' memory-mapped views (`MemoryMappedBlockAllocator.DestroyMappedFile` →
|
||||
`_viewPtr = null`). A worker still inside `ReadBlock` then dereferences the dead view
|
||||
pointer → **uncatchable AccessViolationException with `ReadBlock` on the stack** — the
|
||||
recorded crash signature, firing on close/relaunch during decode storms. **FIXED this
|
||||
session** (quiesce-before-dispose in `ObjectMeshManager.Dispose` + entry/loop guards;
|
||||
15 s loud join in `LandblockStreamer.Dispose`).
|
||||
3. **The white-walls mechanism is STILL OPEN** (issue #105) — but it is now narrowed and
|
||||
instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log
|
||||
line (`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`). They fire ONLY on anomaly
|
||||
(zero cost when healthy) and stay in the build; the next organic occurrence self-attributes
|
||||
in the launch log.
|
||||
4. **Two false trails refuted:** both investigation subagents confidently described a
|
||||
`ReadBlock` instance-field cursor race — the fields don't exist in 2.1.7 (all cursor state
|
||||
is method-local over a stable read-only view). One also cited a vendored
|
||||
`references/DatReaderWriter` copy that doesn't exist. Verify agent claims against source.
|
||||
|
||||
---
|
||||
|
||||
## 1. The threading topology (verified, file:line)
|
||||
|
||||
ONE `DatCollection` (`GameWindow.cs:1166`, `DatAccessType.Read`) → four `DatDatabase`s
|
||||
(Portal/Cell/HighRes/Local), each with its own `MemoryMappedBlockAllocator`. Reader
|
||||
populations on the SAME instances:
|
||||
|
||||
| Population | Entry | Serialization |
|
||||
|---|---|---|
|
||||
| Render thread | `CellLoader`-side hydration, `TextureCache.DecodeFromDats`, `TickAnimations`/`MotionResolver`, `WorldPicker` | `_datLock` at 3 sites only (`GameWindow.cs:2285,5603,8948,11224`); the rest **unlocked** |
|
||||
| Streamer worker | `BuildLandblockForStreaming` (`GameWindow.cs:5142`) | the whole load under `GameWindow._datLock` |
|
||||
| Decode pool (≤4) | `ObjectMeshManager.ProcessQueueAsync` → `_dats.*` via `DatCollectionAdapter` | `DatDatabaseWrapper._lock` (per-db, `DatCollectionAdapter.cs:120`) — **a different lock object than `_datLock`** |
|
||||
| Decode pool, raw | `DatCollectionAdapter.ResolveId` (`DatCollectionAdapter.cs:74-96`) | **none** (raw `Tree.TryGetFile`) |
|
||||
|
||||
Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant.
|
||||
**Post-investigation verdict: that concurrency is fine for reads** (see §2) — the locks are
|
||||
not protecting against anything real on the read path. Do NOT "unify the lock domains" as a
|
||||
fix for a read race that doesn't exist; the locks' remaining value is incidental (they
|
||||
serialize some of our own cache fills).
|
||||
|
||||
## 2. Why the read path is safe (audit summary, v2.1.7 source)
|
||||
|
||||
- `MemoryMappedBlockAllocator.ReadBlock` (`Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160`):
|
||||
chain cursor + buffer offset are **locals**; copies are destination-bounded; `_viewPtr` is
|
||||
written only in the ctor and `Expand` (write-path only — never runs for `DatAccessType.Read`).
|
||||
- BTree: `DatBTreeReaderWriter` node cache is an internally-locked LRU (`:62-126`);
|
||||
traversal state is local; nodes are immutable post-unpack.
|
||||
- `DatDatabase.TryGet<T>`: rents from the thread-safe `ArrayPool`, fills via `ReadBlock`,
|
||||
unpacks through a **fresh** `DatBinReader` (per-instance `_offset`), caches into a
|
||||
`ConcurrentDictionary`. `ObjectFactory` = `ConcurrentDictionary.GetOrAdd`;
|
||||
`DBObjAttributeCache` = benign last-wins lazy init of immutable dictionaries.
|
||||
- Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom):
|
||||
`_fileCache[fileId] = value` is published one statement before `value.Id = fileId`
|
||||
(`DatDatabase.cs:389-393`) — a racing reader can see `Id == 0`. Upstream PR candidate.
|
||||
|
||||
**Correction to project lore:** `feedback_phase_a1_hotfix_saga.md`'s "DatBinReader holds a
|
||||
buffer position field per database; concurrent `Get<T>` corrupts" does NOT describe 2.1.7 —
|
||||
`DatBinReader` is created per call. The A.1-era crash predates Phase O's single-reader
|
||||
unification and was likely a different stack (or version). The A.5 spec's "single-threaded by
|
||||
construction" claim was composition-false the day it shipped (N.4's pool decodes already
|
||||
existed) — but per this investigation it also didn't matter for correctness.
|
||||
|
||||
## 3. The teardown fix (this session)
|
||||
|
||||
- `ObjectMeshManager.Dispose` (`src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs`): sets
|
||||
`IsDisposed` under the queue lock, cancels + drains `_pendingRequests`, then waits (≤10 s,
|
||||
5 ms poll) for `_activeWorkers == 0` before returning; logs an error if workers outlive the
|
||||
wait. `ProcessQueueAsync` re-checks `IsDisposed` before every dequeue; both `Prepare*Async`
|
||||
entries and enqueue blocks early-out when disposed.
|
||||
- `LandblockStreamer.Dispose` (`src/AcDream.App/Streaming/LandblockStreamer.cs`): join
|
||||
extended 2 s → 15 s with a loud `[streamer]` error line on timeout (cancellation is honored
|
||||
between jobs, so the wait is bounded by one landblock load).
|
||||
- `GameWindow.OnClosing` order was already correct (streamer → mesh adapter → … → `_dats`
|
||||
last); with both Disposes now actually quiescing, the order is meaningful.
|
||||
- Audio was checked and is NOT a dat-reading population (no `_dats` access under
|
||||
`src/AcDream.App/Audio/`).
|
||||
|
||||
Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit).
|
||||
|
||||
## 4. The white-walls residual (#105) + tripwires
|
||||
|
||||
Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while
|
||||
the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session;
|
||||
strictly intermittent (one clean launch had it, a heavier probe launch didn't); **zero error
|
||||
output ever** — today's white-wall launch log was 35 lines, no `[wb-error]`.
|
||||
|
||||
All silent exits found and tripwired (they print ONLY on anomaly; keep them):
|
||||
|
||||
| Line | Where | Meaning when it fires |
|
||||
|---|---|---|
|
||||
| `[dat-miss]` | `DatDatabaseWrapper.TryGet` (`DatCollectionAdapter.cs`) | a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault |
|
||||
| `[tex-skip]` | `ObjectMeshManager` GfxObj + CellStruct texture chains (5 exits) | a polygon batch (incl. WALL batches) was silently dropped on a dat miss |
|
||||
| `[tex-miss]` | `TextureCache.DecodeFromDats` (3 exits) | render-thread texture decode fell back to magenta |
|
||||
| `[cell-miss]` | `GameWindow` interior hydration (EnvCell / Environment null) | a cell's walls were silently never registered — the exact white-walls geometry signature |
|
||||
|
||||
Discriminator for the next occurrence: **magenta walls** → TextureCache path; **see-through
|
||||
walls + `[tex-skip]`/`[dat-miss]`** → mesh-build path; **see-through + `[cell-miss]`** →
|
||||
hydration; **white/missing with NO tripwire output** → the failure is GL-side (staged upload /
|
||||
bindless residency under load) — instrument `WbMeshAdapter.Tick`'s upload drain next.
|
||||
|
||||
## 5. Apparatus
|
||||
|
||||
- `tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs` — the hammer
|
||||
(8 threads × 25 shuffled passes; raw `TryGetFileBytes` + typed `TryGet` with
|
||||
`FileCachingStrategy.Never`; golden FNV-1a fingerprints). Skips cleanly when dats absent.
|
||||
KEEP as the regression guard for any future dat-reader version bump.
|
||||
- Package source for reference: `git clone --depth 1 --branch release/2.1.7
|
||||
https://github.com/Chorizite/DatReaderWriter` (tag verified == shipped NuGet DLL).
|
||||
- The tripwire lines (§4) — grep any launch log for `dat-miss|tex-miss|tex-skip|cell-miss`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue