diff --git a/docs/research/2026-06-10-105-110-white-textures-nearplane-handoff.md b/docs/research/2026-06-10-105-110-white-textures-nearplane-handoff.md new file mode 100644 index 00000000..2f146a6c --- /dev/null +++ b/docs/research/2026-06-10-105-110-white-textures-nearplane-handoff.md @@ -0,0 +1,210 @@ +# HANDOFF — #105 intermittent missing indoor textures × #110 near-plane correlation + +**Date:** 2026-06-10 (late). **Branch:** `claude/thirsty-goldberg-51bb9b`, HEAD `8bd3492`. +**Status:** #105 struck twice today with the dat-side tripwires SILENT (= GL-side); the +retail near-plane fix (`137b4f2`, 0.1 m) was bisect-implicated in those two runs and +REVERTED (`8bd3492`) pending this investigation. **One investigation, three payoffs:** +attribute + kill the chronic #105, settle whether the near plane is innocent (#110), and +re-land `znear=0.1` which closes the §4 corner see-through-wall. + +Read this top-to-bottom before touching code. The render digest +(`claude-memory/project_render_pipeline_digest.md`) carries the distilled state + the +DO-NOT-RETRY table — this doc is the deep-dive for THIS investigation. + +--- + +## 0. TL;DR + +1. **#105 (chronic since ~2026-06-08):** indoor wall/cell textures intermittently render + wrong ("missing" — exact appearance white-vs-invisible NOT yet confirmed by the user; + question outstanding). Today it struck on 2 consecutive launches. The four dat-side + tripwires (`[dat-miss]`/`[tex-miss]`/`[tex-skip]`/`[cell-miss]`, commit `7433b70`) + produced ZERO output on both bad runs → per the #105 protocol the failure is + **GL-side**: staged mesh/texture upload, bindless handle creation/residency, or the + per-batch handle plumbing — NOT a failed dat read. +2. **#110:** both bad runs were on the near-plane build (`znear=0.1`, `137b4f2`); the + very next run with `znear=1.0` (working-tree bisect) rendered clean. 2-bad-on-0.1 / + 1-good-on-1.0 is *suggestive, not conclusive* — #105 is intermittent and could have + coincided. No mechanism is known by which znear touches texturing (see §4 for the + honest candidate list). The change is reverted; all four cameras carry a ⚠️ comment + pointing here. +3. **The leverage:** if #105 is independent (most likely), fixing it exonerates the near + plane → re-land `0.1` → §4 corner fix complete. If the near plane genuinely raises + #105's trigger probability (e.g. more close-up geometry → more upload pressure), the + fix is still in the #105 path — the near plane just becomes the best repro lever. + +## 1. Today's run matrix (the evidence) + +| Run (log, worktree root, untracked) | Build (near) | Indoor textures | Notes | +|---|---|---|---| +| `flood-fix-gate.log` | `dac8f6a` (1.0) | OK | user gated the flood fix: transitions clean | +| `flood-fix-gate2.log` | `dac8f6a` (1.0) | OK (no complaint) | #107 spawn wedge run | +| `flood-fix-gate3.log` | `dac8f6a` (1.0) | OK (no complaint) | #107 again | +| `nearplane-gate.log` | `137b4f2` (0.1) | UNKNOWN | user asked to relaunch without detail | +| `nearplane-gate2.log` | `137b4f2` (0.1) | **MISSING** | tripwires silent (checked: 0/0/0/0) | +| `nearplane-gate3.log` | `137b4f2` (0.1) | **MISSING** | fresh launch, same | +| `nearplane-bisect.log` | tree (1.0 on RetailChaseCamera) | **OK** | the bisect run | + +Also relevant: the clean-launch #105 occurrence on 2026-06-09 (35-line log, zero errors, +PRE-near-plane) — proof #105 strikes on `znear=1.0` builds too. The near plane cannot be +the sole cause; the open question is independence vs trigger-probability. + +## 2. #105 history — what is already settled (DO NOT REDO) + +From `docs/research/2026-06-09-dat-reader-thread-safety-investigation.md` + the digest: + +- **Concurrent dat READS are SAFE** (Chorizite.DatReaderWriter 2.1.7): source audit + the + in-tree hammer `DatConcurrencyStressTests` (~1.1 M concurrent reads, zero anomalies). + The "thread-unsafe dat reader" lore is refuted for the read path. Do not re-litigate. +- **The teardown AccessViolations were dispose-during-read** (decode pool + streamer not + quiesced before `DatCollection.Dispose` unmapped views) — FIXED `8fadf77`. +- **The "heavy probes cause white walls" framing is PARTIAL at best** — a clean 35-line + launch reproduced white walls; a heavily-probed run rendered fine. Probe load skews + timing (still avoid `ACDREAM_PROBE_FLAP` for visual gates) but is not the cause. +- **Every silent dat-miss exit is tripwired** (`7433b70`): `[dat-miss]` (DatCollection + returns null), `[tex-miss]`/`[tex-skip]` (texture resolve/upload skips), `[cell-miss]` + (EnvCell load misses). Zero output when healthy. Both of today's bad runs: zero output + → **the dat → decode → staged-data side delivered; the loss is between staging and the + draw.** + +## 3. The GL-side texture path — anatomy + where it can lose textures + +The modern pipeline (N.4/N.5, mandatory — see `reference_modern_rendering_pipeline.md`): + +1. **Decode/stage:** `ObjectMeshManager.PrepareMeshDataAsync(id, isSetup)` background- + decodes mesh + texture data → auto-enqueues to `_stagedMeshData`. +2. **Drain:** `WbMeshAdapter.Tick()` (render thread, per frame) drains the staged queue, + creates GL resources, populates `AcSurfaceMetadataTable` (per-batch translucency / + luminosity / fog metadata). +3. **Texture upload:** `TextureCache` `GetOrUpload*Bindless` → GL texture (parallel + Texture2DArray uploads via `UploadRgba8AsLayer1Array`) → `glGetTextureHandleARB` → + `glMakeTextureHandleResidentARB`. Returns the 64-bit handle. +4. **Per-batch plumbing:** the handle lands in `ObjectRenderBatch.BindlessTextureHandle`. + - Entities: `WbDrawDispatcher` Phase 5 uploads `_batchSsbo` (binding=1, + `(uvec2 handle, uint layer, uint flags)` per group). + - Cell shells: `EnvCellRenderer.RenderModernMDIInternal` packs `ModernBatchData + { TextureHandle, TextureIndex }` → `_modernBatchBuffer` (SSBO binding=1, bound at + EnvCellRenderer.cs:1211). +5. **Sampling:** `mesh_modern.frag` constructs `sampler2DArray(handle)` from the uvec2. + **A zero handle samples garbage/black/white (undefined)** — this is the classic + "white walls" appearance. + +Loss candidates between staging and draw (ranked): + +- **(a) Zero handle at draw time** — the batch was prepared before its texture upload + completed, and nothing back-patches the handle. Known to exist transiently (textures + pop in); a bug would make it PERSIST. ⚠️ **There is an EXISTING probe for exactly + this:** `ACDREAM_PROBE_SHELL=1` (`RenderingDiagnostics.ProbeShellEnabled`) prints, per + visible cell, gfxObj/batch counts AND `zh=` (zero-bindless-handle batch count) — see + RenderingDiagnostics.cs:120-130. **One bad-run launch with this probe splits the + search space in half** (zh>0 ⇒ upload/handle side; zh==0 ⇒ resident-but-wrong-content + or sampling/state side). +- **(b) Residency loss / never-made-resident** — handle non-zero but + `MakeTextureHandleResidentARB` skipped or undone → same visual, but zh probe reads 0. + Needs a residency assert (glIsTextureHandleResidentARB sweep) or RenderDoc. +- **(c) Upload raced/dropped under pressure** — `MaxCompletionsPerFrame` (QualityPreset) + caps streaming completions per frame; a drop/requeue bug under burst load would lose + whole cells' textures. Would likely show as *some* cells white, others fine. +- **(d) Texture content wrong but handle valid** — array-layer mixups (zh==0, content + white). RenderDoc territory. + +## 4. #110 — what `znear` can and cannot plausibly do (senior-dev honest list) + +`znear` enters the system in exactly one object: the projection matrix +(`CreatePerspectiveFieldOfView(FovY, aspect, znear, 5000f)` in RetailChaseCamera / +ChaseCamera / FlyCamera / OrbitCamera — all currently 1.0 with ⚠️ comments). +Downstream consumers of `viewProj`: + +| Consumer | Effect of 0.1 vs 1.0 | Texture relevance | +|---|---|---| +| Rasterization | geometry 0.1–1.0 m from the eye now draws; depth distribution shifts (D24 @ 5 m: ~1.5 µm → ~15 µm) | none direct; z-fighting would flicker, not "lose" textures | +| Frustum culls (terrain/entity/EnvCell prepare) | strictly MORE visible (near plane closer) | **more batches prepared per frame → more uploads in flight → raises (a)/(c) trigger probability** ← the only credible #105×#110 link found | +| PortalVisibilityBuilder flood | viewProj changes per-vertex w by near-plane row only in z-row; flood clip planes are (nx,ny,0,dw) — x,y,w-based; the flood is conformance-gated (`CornerSweep_FloodIsCompleteAndMonotone`) | none | +| gl_ClipDistance regions / terrain UBO | x,y,w-based, near-independent | none | +| Doorway scissor | computed from NdcAabb, projection-independent at the box level | none | + +**Conclusion to verify, not assume:** the most credible story is that `znear=0.1` makes +close-up geometry (the wall right behind the camera, the doorframe you're brushing) +*newly visible*, inflating per-frame prepare/upload pressure indoors, which raises the +probability of the pre-existing #105 loss. If true: fixing #105 exonerates the near +plane entirely. The alternative (0.1 breaks texturing via a mechanism not in this table) +needs RenderDoc evidence before being believed. + +## 5. Investigation plan (staged, evidence-first) + +**Phase A — attribute with the existing probe (cheap, decisive split):** +1. Launch with `ACDREAM_PROBE_SHELL=1` (+ the always-on dat tripwires). Flip-launch until + a bad run reproduces (today it was 2/3 on the 0.1 build — consider temporarily + re-applying 0.1 to the working tree as the REPRO LEVER ONLY, clearly uncommitted). +2. On a bad run: read `[shell]` lines for the affected cells. `zh>0` ⇒ path (a)/(b): + zero/never-patched handles — go to Phase C1. `zh==0` ⇒ path (b)/(d) — go to Phase C2. +3. ALSO capture the user's answer: **white surfaces vs invisible walls** (outstanding + question — invisible would point at visibility/depth instead and reshape this plan). + +**Phase B — settle the #110 correlation statistically (parallel, mechanical):** +Alternate launches 0.1 / 1.0 (working-tree flip on RetailChaseCamera only), ≥4 runs per +arm, record texture state per run (the `[shell] zh=` counts make this detectable WITHOUT +user eyes if path (a)). Independence ⇒ bad runs appear in both arms. The 2026-06-09 +clean-launch occurrence already proves 1.0 is not immune. + +**Phase C — root-cause:** +- **C1 (zero handles):** instrument the staging→handle path: log every batch that reaches + the draw SSBO with handle==0 (entity + EnvCell sides), plus `WbMeshAdapter.Tick` drain + counts and `TextureCache` upload completions per frame. Find who created a batch before + its texture and never back-patched. Fix = the back-patch / ordering, NOT a retry loop. +- **C2 (valid handles, wrong output):** RenderDoc the bad frame (GPU truth): inspect + binding=1 SSBO contents, handle residency, sampled texture content, and the draw state + of an affected wall batch. + +**Phase D — close out:** fix #105 root cause → flip-test again (both arms clean) → +re-land `znear=0.1` (re-apply the `137b4f2` payload: 4 cameras + restore the retail +citation comments) → user re-gates the §4 corner press (wall must stay solid at the +camera) + a distance scan for any new z-shimmer (none expected; retail ships 0.1). + +## 6. Repro notes + session-ops gotchas (cost real time today) + +- **Repro spot:** Holtburg houses near the player's parked position (the user was trying + on a house interior; exact house id unconfirmed — textures were missing across the + interior). Frequency today: 2 of 3 launches on the 0.1 build. +- **#107 interference:** logging in while parked INDOORS wedges the player (stuck in + air/wall, 3-for-3 today — filed). For THIS investigation prefer ending test sessions + with the character OUTDOORS so logins are clean. If wedged: relaunch; it intermittently + recovers. +- **ACE session hold:** graceful window close ⇒ ~3–5 s; hard kill ⇒ ~3 min of + `session failed` (exit 29). The launch protocol + wait loop used all day is in this + session's transcript; `auto-entered player mode` is the in-world marker. +- **⚠️ PowerShell 5.1 `Get-Content`/`Set-Content` MANGLES UTF-8 source files** (reads + CP1252, writes mojibake — corrupted all four camera files today; recovered via + `git checkout` + redoing edits with the Edit tool). **Never bulk-edit source with + PS5.1 string replace.** +- **Tee-Object logs are UTF-16LE** — Python analyzers must BOM-detect; PowerShell + `Select-String` handles them natively. +- Probes: `ACDREAM_PROBE_SHELL` is heavy-ish (per-prepare dumps) — short runs. The dat + tripwires are always-on and free. NEVER judge visuals under `ACDREAM_PROBE_FLAP`. + +## 7. What ELSE is open (do not drift into these) + +Priority order (set 2026-06-10, digest carries it): **this investigation (#105+#110)** → +**#107** indoor-login spawn wedge (physics; `ACDREAM_CAPTURE_RESOLVE` apparatus ready) → +**#108** cellar-ascent grass sweep + **#109** far-exit-door oscillation (one render +session, probe captures at their spots) → **#99/A6.P4** per-cell shadow architecture +(planned phase). The §4 flood strobe is FIXED (`dac8f6a`, user-gated) — its conformance +gate (`CornerSweep_FloodIsCompleteAndMonotone`) and the corner-seal characterization +(`CameraCornerSealReplayTests`) must stay green through any change here. + +## 8. Today's commit ledger (context for blame/diff archaeology) + +| Commit | What | +|---|---| +| `682cba3` | [clip-route] probe apparatus (outdoor flap) | +| `c4df241` | outdoor full-world flap FIX (EnvCellRenderer DepthMask(false) leak → depth-clear no-op) | +| `df2ef7c` | flap close-out doc | +| `b21bb28` | corner-seal replay — camera-penetration hypothesis REFUTED (openings, not walls) | +| `482b0de` | corner-seal handoff doc | +| `dac8f6a` | §4 flood strobe FIX (homogeneous reciprocal clip + collinear-aware dedup) — user-gated | +| `137b4f2` | near plane 1.0→0.1 (retail znear) + issues #107–#109 filed | +| `8bd3492` | near plane reverted to 1.0 pending #110; #110 filed | + +Test baseline: App **223**, Core **1377** green + 4 pre-existing #99-era failures +(DoorBugTrajectoryReplay ×2 / DoorCollisionApparatus / BSPStepUp) + 1 skip, UI **420**, +Net **294**. ACE on `127.0.0.1:9000`, `testaccount/testpassword`, `+Acdream`.