docs: dat-reader investigation handoff + file #105 (white walls, tripwired)
Records the 2026-06-09 dat-reader thread-safety investigation: concurrent READS on Chorizite.DatReaderWriter 2.1.7 exonerated (source audit + 1.1M-read hammer,b3920d8); the real crash was dispose-during-read at teardown (fixed,8fadf77); the white-walls mechanism remains open as #105 with every silent dat-miss exit tripwired (7433b70) so the next occurrence self-attributes. Also corrects project lore: the A.1-era rule that all dat reads must stay on one thread does not hold for the 2.1.7 read path, and both investigation subagents'' claimed ReadBlock instance-field race does not exist in the shipped source — verify agent claims against source before acting on them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
b3920d83f6
commit
8dc707d43b
2 changed files with 172 additions and 0 deletions
|
|
@ -44,6 +44,47 @@ Copy this block when adding a new issue:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## #105 — Intermittent silent dat-load failure: white/missing cottage walls until relaunch
|
||||||
|
|
||||||
|
**Status:** OPEN
|
||||||
|
**Severity:** HIGH
|
||||||
|
**Filed:** 2026-06-09
|
||||||
|
**Component:** render, dat, streaming
|
||||||
|
|
||||||
|
**Description:** Intermittently (twice user-confirmed: once under heavy probe logging
|
||||||
|
2026-06-08, once on a completely clean launch 2026-06-09), the Holtburg cottage wall surfaces
|
||||||
|
render as background/clear color while the cells' static objects (paintings, furniture,
|
||||||
|
windows) draw normally. Once broken, broken for the whole session — the failed result is
|
||||||
|
cached (mesh batches build once, hydration runs once). The failure has never produced a single
|
||||||
|
log line: every dat-read failure on the walls-relevant paths exits silently.
|
||||||
|
|
||||||
|
**Root cause / status:** NOT YET ATTRIBUTED — but heavily narrowed (full evidence:
|
||||||
|
`docs/research/2026-06-09-dat-reader-thread-safety-investigation.md`). Refuted: the dat
|
||||||
|
library's concurrent read path (Chorizite.DatReaderWriter 2.1.7) is exonerated by line-level
|
||||||
|
source audit + the in-tree hammer test (`DatConcurrencyStressTests`, ~1.1M concurrent reads,
|
||||||
|
zero anomalies). Fixed nearby: the teardown dispose-during-read AccessViolation (`8fadf77`).
|
||||||
|
Remaining candidates: a lifecycle/ordering race in our hydration/streaming pipeline, or a
|
||||||
|
GL-side staged-upload / bindless-residency failure under load. Tripwire log lines
|
||||||
|
(`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`, commit `7433b70`) now cover every
|
||||||
|
silent exit; the next occurrence self-attributes in the launch log. Discriminator: magenta
|
||||||
|
walls = TextureCache decode miss; see-through + `[tex-skip]`/`[dat-miss]` = mesh-build miss;
|
||||||
|
see-through + `[cell-miss]` = hydration miss; broken with NO tripwire output = GL-side —
|
||||||
|
instrument `WbMeshAdapter.Tick`'s upload drain next.
|
||||||
|
|
||||||
|
**Files:** `src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs` (texture chains),
|
||||||
|
`src/AcDream.App/Rendering/Wb/DatCollectionAdapter.cs` (`DatDatabaseWrapper.TryGet`),
|
||||||
|
`src/AcDream.App/Rendering/TextureCache.cs` (`DecodeFromDats`),
|
||||||
|
`src/AcDream.App/Rendering/GameWindow.cs` (`BuildInteriorEntitiesForStreaming` hydration).
|
||||||
|
|
||||||
|
**Research:** `docs/research/2026-06-09-dat-reader-thread-safety-investigation.md`.
|
||||||
|
|
||||||
|
**Acceptance:** the next white-wall occurrence is attributed by a tripwire line (or by their
|
||||||
|
collective silence, pointing GL-side); the attributed root cause is fixed; walls survive 10+
|
||||||
|
launches including heavy-load ones. Tripwires can then be stripped (or kept as permanent
|
||||||
|
anomaly logging — decide at close).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## #104 — Scene VFX particles not clipped to the PView visible cell set
|
## #104 — Scene VFX particles not clipped to the PView visible cell set
|
||||||
|
|
||||||
**Status:** OPEN
|
**Status:** OPEN
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,131 @@
|
||||||
|
# Dat-reader thread-safety investigation — read path EXONERATED, teardown bug FIXED, white-walls tripwired
|
||||||
|
|
||||||
|
**Date:** 2026-06-09. **Branch:** `claude/thirsty-goldberg-51bb9b`.
|
||||||
|
**Trigger:** white cottage walls reproduced on a CLEAN (probe-free) launch, refuting the prior
|
||||||
|
session's "heavy probes starve the dat-reader" framing as the whole story; user picked the
|
||||||
|
dat-reader stability fix as the session direction.
|
||||||
|
|
||||||
|
> Read this before touching any dat-threading code. It REVERSES the long-standing
|
||||||
|
> "DatCollection is NOT thread-safe (for reads)" lore for the current package version, and it
|
||||||
|
> records exactly what was verified, what was refuted, and what remains open.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. TL;DR
|
||||||
|
|
||||||
|
1. **Concurrent READS on Chorizite.DatReaderWriter 2.1.7 are SAFE** — verified two independent
|
||||||
|
ways: (a) a line-level audit of the actual `release/2.1.7` source (cross-checked against the
|
||||||
|
decompiled NuGet DLL); (b) an in-tree hammer test
|
||||||
|
(`tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs`) that replays our
|
||||||
|
four-population access pattern — **~1.1 M concurrent reads, 8 threads, shuffled orders, raw
|
||||||
|
block path AND full typed unpack path — zero anomalies, byte-identical to single-threaded
|
||||||
|
golden fingerprints.**
|
||||||
|
2. **The verified crash bug is teardown, not reads:** `ObjectMeshManager.Dispose` never
|
||||||
|
quiesced its `Task.Run(ProcessQueueAsync)` decode workers and `LandblockStreamer.Dispose`
|
||||||
|
gave up its join after 2 s — then `GameWindow.OnClosing` disposed the `DatCollection`, which
|
||||||
|
unmaps the dats' memory-mapped views (`MemoryMappedBlockAllocator.DestroyMappedFile` →
|
||||||
|
`_viewPtr = null`). A worker still inside `ReadBlock` then dereferences the dead view
|
||||||
|
pointer → **uncatchable AccessViolationException with `ReadBlock` on the stack** — the
|
||||||
|
recorded crash signature, firing on close/relaunch during decode storms. **FIXED this
|
||||||
|
session** (quiesce-before-dispose in `ObjectMeshManager.Dispose` + entry/loop guards;
|
||||||
|
15 s loud join in `LandblockStreamer.Dispose`).
|
||||||
|
3. **The white-walls mechanism is STILL OPEN** (issue #105) — but it is now narrowed and
|
||||||
|
instrumented: every silent dat-failure exit on the walls-relevant paths has a tripwire log
|
||||||
|
line (`[dat-miss]` / `[tex-miss]` / `[tex-skip]` / `[cell-miss]`). They fire ONLY on anomaly
|
||||||
|
(zero cost when healthy) and stay in the build; the next organic occurrence self-attributes
|
||||||
|
in the launch log.
|
||||||
|
4. **Two false trails refuted:** both investigation subagents confidently described a
|
||||||
|
`ReadBlock` instance-field cursor race — the fields don't exist in 2.1.7 (all cursor state
|
||||||
|
is method-local over a stable read-only view). One also cited a vendored
|
||||||
|
`references/DatReaderWriter` copy that doesn't exist. Verify agent claims against source.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. The threading topology (verified, file:line)
|
||||||
|
|
||||||
|
ONE `DatCollection` (`GameWindow.cs:1166`, `DatAccessType.Read`) → four `DatDatabase`s
|
||||||
|
(Portal/Cell/HighRes/Local), each with its own `MemoryMappedBlockAllocator`. Reader
|
||||||
|
populations on the SAME instances:
|
||||||
|
|
||||||
|
| Population | Entry | Serialization |
|
||||||
|
|---|---|---|
|
||||||
|
| Render thread | `CellLoader`-side hydration, `TextureCache.DecodeFromDats`, `TickAnimations`/`MotionResolver`, `WorldPicker` | `_datLock` at 3 sites only (`GameWindow.cs:2285,5603,8948,11224`); the rest **unlocked** |
|
||||||
|
| Streamer worker | `BuildLandblockForStreaming` (`GameWindow.cs:5142`) | the whole load under `GameWindow._datLock` |
|
||||||
|
| Decode pool (≤4) | `ObjectMeshManager.ProcessQueueAsync` → `_dats.*` via `DatCollectionAdapter` | `DatDatabaseWrapper._lock` (per-db, `DatCollectionAdapter.cs:120`) — **a different lock object than `_datLock`** |
|
||||||
|
| Decode pool, raw | `DatCollectionAdapter.ResolveId` (`DatCollectionAdapter.cs:74-96`) | **none** (raw `Tree.TryGetFile`) |
|
||||||
|
|
||||||
|
Two disjoint lock domains + unlocked paths ⇒ concurrent same-instance access is constant.
|
||||||
|
**Post-investigation verdict: that concurrency is fine for reads** (see §2) — the locks are
|
||||||
|
not protecting against anything real on the read path. Do NOT "unify the lock domains" as a
|
||||||
|
fix for a read race that doesn't exist; the locks' remaining value is incidental (they
|
||||||
|
serialize some of our own cache fills).
|
||||||
|
|
||||||
|
## 2. Why the read path is safe (audit summary, v2.1.7 source)
|
||||||
|
|
||||||
|
- `MemoryMappedBlockAllocator.ReadBlock` (`Lib/IO/BlockAllocators/MemoryMappedBlockAllocator.cs:137-160`):
|
||||||
|
chain cursor + buffer offset are **locals**; copies are destination-bounded; `_viewPtr` is
|
||||||
|
written only in the ctor and `Expand` (write-path only — never runs for `DatAccessType.Read`).
|
||||||
|
- BTree: `DatBTreeReaderWriter` node cache is an internally-locked LRU (`:62-126`);
|
||||||
|
traversal state is local; nodes are immutable post-unpack.
|
||||||
|
- `DatDatabase.TryGet<T>`: rents from the thread-safe `ArrayPool`, fills via `ReadBlock`,
|
||||||
|
unpacks through a **fresh** `DatBinReader` (per-instance `_offset`), caches into a
|
||||||
|
`ConcurrentDictionary`. `ObjectFactory` = `ConcurrentDictionary.GetOrAdd`;
|
||||||
|
`DBObjAttributeCache` = benign last-wins lazy init of immutable dictionaries.
|
||||||
|
- Hammer test confirms empirically (see §0.1). Known minor wart (not our symptom):
|
||||||
|
`_fileCache[fileId] = value` is published one statement before `value.Id = fileId`
|
||||||
|
(`DatDatabase.cs:389-393`) — a racing reader can see `Id == 0`. Upstream PR candidate.
|
||||||
|
|
||||||
|
**Correction to project lore:** `feedback_phase_a1_hotfix_saga.md`'s "DatBinReader holds a
|
||||||
|
buffer position field per database; concurrent `Get<T>` corrupts" does NOT describe 2.1.7 —
|
||||||
|
`DatBinReader` is created per call. The A.1-era crash predates Phase O's single-reader
|
||||||
|
unification and was likely a different stack (or version). The A.5 spec's "single-threaded by
|
||||||
|
construction" claim was composition-false the day it shipped (N.4's pool decodes already
|
||||||
|
existed) — but per this investigation it also didn't matter for correctness.
|
||||||
|
|
||||||
|
## 3. The teardown fix (this session)
|
||||||
|
|
||||||
|
- `ObjectMeshManager.Dispose` (`src/AcDream.App/Rendering/Wb/ObjectMeshManager.cs`): sets
|
||||||
|
`IsDisposed` under the queue lock, cancels + drains `_pendingRequests`, then waits (≤10 s,
|
||||||
|
5 ms poll) for `_activeWorkers == 0` before returning; logs an error if workers outlive the
|
||||||
|
wait. `ProcessQueueAsync` re-checks `IsDisposed` before every dequeue; both `Prepare*Async`
|
||||||
|
entries and enqueue blocks early-out when disposed.
|
||||||
|
- `LandblockStreamer.Dispose` (`src/AcDream.App/Streaming/LandblockStreamer.cs`): join
|
||||||
|
extended 2 s → 15 s with a loud `[streamer]` error line on timeout (cancellation is honored
|
||||||
|
between jobs, so the wait is bounded by one landblock load).
|
||||||
|
- `GameWindow.OnClosing` order was already correct (streamer → mesh adapter → … → `_dats`
|
||||||
|
last); with both Disposes now actually quiescing, the order is meaningful.
|
||||||
|
- Audio was checked and is NOT a dat-reading population (no `_dats` access under
|
||||||
|
`src/AcDream.App/Audio/`).
|
||||||
|
|
||||||
|
Smoke: 3× close-mid-decode-storm launches, clean exits, no crash signatures (see commit).
|
||||||
|
|
||||||
|
## 4. The white-walls residual (#105) + tripwires
|
||||||
|
|
||||||
|
Observed signature (2× user-confirmed): cottage wall surfaces render as background/clear while
|
||||||
|
the cell's statics (paintings, furniture, windows) draw; once broken, broken for the session;
|
||||||
|
strictly intermittent (one clean launch had it, a heavier probe launch didn't); **zero error
|
||||||
|
output ever** — today's white-wall launch log was 35 lines, no `[wb-error]`.
|
||||||
|
|
||||||
|
All silent exits found and tripwired (they print ONLY on anomaly; keep them):
|
||||||
|
|
||||||
|
| Line | Where | Meaning when it fires |
|
||||||
|
|---|---|---|
|
||||||
|
| `[dat-miss]` | `DatDatabaseWrapper.TryGet` (`DatCollectionAdapter.cs`) | a read failed for an id whose BTree entry EXISTS — the smoking gun for any real dat-layer fault |
|
||||||
|
| `[tex-skip]` | `ObjectMeshManager` GfxObj + CellStruct texture chains (5 exits) | a polygon batch (incl. WALL batches) was silently dropped on a dat miss |
|
||||||
|
| `[tex-miss]` | `TextureCache.DecodeFromDats` (3 exits) | render-thread texture decode fell back to magenta |
|
||||||
|
| `[cell-miss]` | `GameWindow` interior hydration (EnvCell / Environment null) | a cell's walls were silently never registered — the exact white-walls geometry signature |
|
||||||
|
|
||||||
|
Discriminator for the next occurrence: **magenta walls** → TextureCache path; **see-through
|
||||||
|
walls + `[tex-skip]`/`[dat-miss]`** → mesh-build path; **see-through + `[cell-miss]`** →
|
||||||
|
hydration; **white/missing with NO tripwire output** → the failure is GL-side (staged upload /
|
||||||
|
bindless residency under load) — instrument `WbMeshAdapter.Tick`'s upload drain next.
|
||||||
|
|
||||||
|
## 5. Apparatus
|
||||||
|
|
||||||
|
- `tests/AcDream.Core.Tests/Conformance/DatConcurrencyStressTests.cs` — the hammer
|
||||||
|
(8 threads × 25 shuffled passes; raw `TryGetFileBytes` + typed `TryGet` with
|
||||||
|
`FileCachingStrategy.Never`; golden FNV-1a fingerprints). Skips cleanly when dats absent.
|
||||||
|
KEEP as the regression guard for any future dat-reader version bump.
|
||||||
|
- Package source for reference: `git clone --depth 1 --branch release/2.1.7
|
||||||
|
https://github.com/Chorizite/DatReaderWriter` (tag verified == shipped NuGet DLL).
|
||||||
|
- The tripwire lines (§4) — grep any launch log for `dat-miss|tex-miss|tex-skip|cell-miss`.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue