docs: #105 x #110 handoff - white-texture GL-side investigation plan + near-plane re-land path

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Erik 2026-06-10 11:17:36 +02:00
parent 8bd3492612
commit 5d63038b61

View file

@ -0,0 +1,210 @@
# HANDOFF — #105 intermittent missing indoor textures × #110 near-plane correlation
**Date:** 2026-06-10 (late). **Branch:** `claude/thirsty-goldberg-51bb9b`, HEAD `8bd3492`.
**Status:** #105 struck twice today with the dat-side tripwires SILENT (= GL-side); the
retail near-plane fix (`137b4f2`, 0.1 m) was bisect-implicated in those two runs and
REVERTED (`8bd3492`) pending this investigation. **One investigation, three payoffs:**
attribute + kill the chronic #105, settle whether the near plane is innocent (#110), and
re-land `znear=0.1` which closes the §4 corner see-through-wall.
Read this top-to-bottom before touching code. The render digest
(`claude-memory/project_render_pipeline_digest.md`) carries the distilled state + the
DO-NOT-RETRY table — this doc is the deep-dive for THIS investigation.
---
## 0. TL;DR
1. **#105 (chronic since ~2026-06-08):** indoor wall/cell textures intermittently render
wrong ("missing" — exact appearance white-vs-invisible NOT yet confirmed by the user;
question outstanding). Today it struck on 2 consecutive launches. The four dat-side
tripwires (`[dat-miss]`/`[tex-miss]`/`[tex-skip]`/`[cell-miss]`, commit `7433b70`)
produced ZERO output on both bad runs → per the #105 protocol the failure is
**GL-side**: staged mesh/texture upload, bindless handle creation/residency, or the
per-batch handle plumbing — NOT a failed dat read.
2. **#110:** both bad runs were on the near-plane build (`znear=0.1`, `137b4f2`); the
very next run with `znear=1.0` (working-tree bisect) rendered clean. 2-bad-on-0.1 /
1-good-on-1.0 is *suggestive, not conclusive*#105 is intermittent and could have
coincided. No mechanism is known by which znear touches texturing (see §4 for the
honest candidate list). The change is reverted; all four cameras carry a ⚠️ comment
pointing here.
3. **The leverage:** if #105 is independent (most likely), fixing it exonerates the near
plane → re-land `0.1` → §4 corner fix complete. If the near plane genuinely raises
#105's trigger probability (e.g. more close-up geometry → more upload pressure), the
fix is still in the #105 path — the near plane just becomes the best repro lever.
## 1. Today's run matrix (the evidence)
| Run (log, worktree root, untracked) | Build (near) | Indoor textures | Notes |
|---|---|---|---|
| `flood-fix-gate.log` | `dac8f6a` (1.0) | OK | user gated the flood fix: transitions clean |
| `flood-fix-gate2.log` | `dac8f6a` (1.0) | OK (no complaint) | #107 spawn wedge run |
| `flood-fix-gate3.log` | `dac8f6a` (1.0) | OK (no complaint) | #107 again |
| `nearplane-gate.log` | `137b4f2` (0.1) | UNKNOWN | user asked to relaunch without detail |
| `nearplane-gate2.log` | `137b4f2` (0.1) | **MISSING** | tripwires silent (checked: 0/0/0/0) |
| `nearplane-gate3.log` | `137b4f2` (0.1) | **MISSING** | fresh launch, same |
| `nearplane-bisect.log` | tree (1.0 on RetailChaseCamera) | **OK** | the bisect run |
Also relevant: the clean-launch #105 occurrence on 2026-06-09 (35-line log, zero errors,
PRE-near-plane) — proof #105 strikes on `znear=1.0` builds too. The near plane cannot be
the sole cause; the open question is independence vs trigger-probability.
## 2. #105 history — what is already settled (DO NOT REDO)
From `docs/research/2026-06-09-dat-reader-thread-safety-investigation.md` + the digest:
- **Concurrent dat READS are SAFE** (Chorizite.DatReaderWriter 2.1.7): source audit + the
in-tree hammer `DatConcurrencyStressTests` (~1.1 M concurrent reads, zero anomalies).
The "thread-unsafe dat reader" lore is refuted for the read path. Do not re-litigate.
- **The teardown AccessViolations were dispose-during-read** (decode pool + streamer not
quiesced before `DatCollection.Dispose` unmapped views) — FIXED `8fadf77`.
- **The "heavy probes cause white walls" framing is PARTIAL at best** — a clean 35-line
launch reproduced white walls; a heavily-probed run rendered fine. Probe load skews
timing (still avoid `ACDREAM_PROBE_FLAP` for visual gates) but is not the cause.
- **Every silent dat-miss exit is tripwired** (`7433b70`): `[dat-miss]` (DatCollection
returns null), `[tex-miss]`/`[tex-skip]` (texture resolve/upload skips), `[cell-miss]`
(EnvCell load misses). Zero output when healthy. Both of today's bad runs: zero output
→ **the dat → decode → staged-data side delivered; the loss is between staging and the
draw.**
## 3. The GL-side texture path — anatomy + where it can lose textures
The modern pipeline (N.4/N.5, mandatory — see `reference_modern_rendering_pipeline.md`):
1. **Decode/stage:** `ObjectMeshManager.PrepareMeshDataAsync(id, isSetup)` background-
decodes mesh + texture data → auto-enqueues to `_stagedMeshData`.
2. **Drain:** `WbMeshAdapter.Tick()` (render thread, per frame) drains the staged queue,
creates GL resources, populates `AcSurfaceMetadataTable` (per-batch translucency /
luminosity / fog metadata).
3. **Texture upload:** `TextureCache` `GetOrUpload*Bindless` → GL texture (parallel
Texture2DArray uploads via `UploadRgba8AsLayer1Array`) → `glGetTextureHandleARB`
`glMakeTextureHandleResidentARB`. Returns the 64-bit handle.
4. **Per-batch plumbing:** the handle lands in `ObjectRenderBatch.BindlessTextureHandle`.
- Entities: `WbDrawDispatcher` Phase 5 uploads `_batchSsbo` (binding=1,
`(uvec2 handle, uint layer, uint flags)` per group).
- Cell shells: `EnvCellRenderer.RenderModernMDIInternal` packs `ModernBatchData
{ TextureHandle, TextureIndex }` → `_modernBatchBuffer` (SSBO binding=1, bound at
EnvCellRenderer.cs:1211).
5. **Sampling:** `mesh_modern.frag` constructs `sampler2DArray(handle)` from the uvec2.
**A zero handle samples garbage/black/white (undefined)** — this is the classic
"white walls" appearance.
Loss candidates between staging and draw (ranked):
- **(a) Zero handle at draw time** — the batch was prepared before its texture upload
completed, and nothing back-patches the handle. Known to exist transiently (textures
pop in); a bug would make it PERSIST. ⚠️ **There is an EXISTING probe for exactly
this:** `ACDREAM_PROBE_SHELL=1` (`RenderingDiagnostics.ProbeShellEnabled`) prints, per
visible cell, gfxObj/batch counts AND `zh=` (zero-bindless-handle batch count) — see
RenderingDiagnostics.cs:120-130. **One bad-run launch with this probe splits the
search space in half** (zh>0 ⇒ upload/handle side; zh==0 ⇒ resident-but-wrong-content
or sampling/state side).
- **(b) Residency loss / never-made-resident** — handle non-zero but
`MakeTextureHandleResidentARB` skipped or undone → same visual, but zh probe reads 0.
Needs a residency assert (glIsTextureHandleResidentARB sweep) or RenderDoc.
- **(c) Upload raced/dropped under pressure** — `MaxCompletionsPerFrame` (QualityPreset)
caps streaming completions per frame; a drop/requeue bug under burst load would lose
whole cells' textures. Would likely show as *some* cells white, others fine.
- **(d) Texture content wrong but handle valid** — array-layer mixups (zh==0, content
white). RenderDoc territory.
## 4. #110 — what `znear` can and cannot plausibly do (senior-dev honest list)
`znear` enters the system in exactly one object: the projection matrix
(`CreatePerspectiveFieldOfView(FovY, aspect, znear, 5000f)` in RetailChaseCamera /
ChaseCamera / FlyCamera / OrbitCamera — all currently 1.0 with ⚠️ comments).
Downstream consumers of `viewProj`:
| Consumer | Effect of 0.1 vs 1.0 | Texture relevance |
|---|---|---|
| Rasterization | geometry 0.11.0 m from the eye now draws; depth distribution shifts (D24 @ 5 m: ~1.5 µm → ~15 µm) | none direct; z-fighting would flicker, not "lose" textures |
| Frustum culls (terrain/entity/EnvCell prepare) | strictly MORE visible (near plane closer) | **more batches prepared per frame → more uploads in flight → raises (a)/(c) trigger probability** ← the only credible #105×#110 link found |
| PortalVisibilityBuilder flood | viewProj changes per-vertex w by near-plane row only in z-row; flood clip planes are (nx,ny,0,dw) — x,y,w-based; the flood is conformance-gated (`CornerSweep_FloodIsCompleteAndMonotone`) | none |
| gl_ClipDistance regions / terrain UBO | x,y,w-based, near-independent | none |
| Doorway scissor | computed from NdcAabb, projection-independent at the box level | none |
**Conclusion to verify, not assume:** the most credible story is that `znear=0.1` makes
close-up geometry (the wall right behind the camera, the doorframe you're brushing)
*newly visible*, inflating per-frame prepare/upload pressure indoors, which raises the
probability of the pre-existing #105 loss. If true: fixing #105 exonerates the near
plane entirely. The alternative (0.1 breaks texturing via a mechanism not in this table)
needs RenderDoc evidence before being believed.
## 5. Investigation plan (staged, evidence-first)
**Phase A — attribute with the existing probe (cheap, decisive split):**
1. Launch with `ACDREAM_PROBE_SHELL=1` (+ the always-on dat tripwires). Flip-launch until
a bad run reproduces (today it was 2/3 on the 0.1 build — consider temporarily
re-applying 0.1 to the working tree as the REPRO LEVER ONLY, clearly uncommitted).
2. On a bad run: read `[shell]` lines for the affected cells. `zh>0` ⇒ path (a)/(b):
zero/never-patched handles — go to Phase C1. `zh==0` ⇒ path (b)/(d) — go to Phase C2.
3. ALSO capture the user's answer: **white surfaces vs invisible walls** (outstanding
question — invisible would point at visibility/depth instead and reshape this plan).
**Phase B — settle the #110 correlation statistically (parallel, mechanical):**
Alternate launches 0.1 / 1.0 (working-tree flip on RetailChaseCamera only), ≥4 runs per
arm, record texture state per run (the `[shell] zh=` counts make this detectable WITHOUT
user eyes if path (a)). Independence ⇒ bad runs appear in both arms. The 2026-06-09
clean-launch occurrence already proves 1.0 is not immune.
**Phase C — root-cause:**
- **C1 (zero handles):** instrument the staging→handle path: log every batch that reaches
the draw SSBO with handle==0 (entity + EnvCell sides), plus `WbMeshAdapter.Tick` drain
counts and `TextureCache` upload completions per frame. Find who created a batch before
its texture and never back-patched. Fix = the back-patch / ordering, NOT a retry loop.
- **C2 (valid handles, wrong output):** RenderDoc the bad frame (GPU truth): inspect
binding=1 SSBO contents, handle residency, sampled texture content, and the draw state
of an affected wall batch.
**Phase D — close out:** fix #105 root cause → flip-test again (both arms clean) →
re-land `znear=0.1` (re-apply the `137b4f2` payload: 4 cameras + restore the retail
citation comments) → user re-gates the §4 corner press (wall must stay solid at the
camera) + a distance scan for any new z-shimmer (none expected; retail ships 0.1).
## 6. Repro notes + session-ops gotchas (cost real time today)
- **Repro spot:** Holtburg houses near the player's parked position (the user was trying
on a house interior; exact house id unconfirmed — textures were missing across the
interior). Frequency today: 2 of 3 launches on the 0.1 build.
- **#107 interference:** logging in while parked INDOORS wedges the player (stuck in
air/wall, 3-for-3 today — filed). For THIS investigation prefer ending test sessions
with the character OUTDOORS so logins are clean. If wedged: relaunch; it intermittently
recovers.
- **ACE session hold:** graceful window close ⇒ ~35 s; hard kill ⇒ ~3 min of
`session failed` (exit 29). The launch protocol + wait loop used all day is in this
session's transcript; `auto-entered player mode` is the in-world marker.
- **⚠️ PowerShell 5.1 `Get-Content`/`Set-Content` MANGLES UTF-8 source files** (reads
CP1252, writes mojibake — corrupted all four camera files today; recovered via
`git checkout` + redoing edits with the Edit tool). **Never bulk-edit source with
PS5.1 string replace.**
- **Tee-Object logs are UTF-16LE** — Python analyzers must BOM-detect; PowerShell
`Select-String` handles them natively.
- Probes: `ACDREAM_PROBE_SHELL` is heavy-ish (per-prepare dumps) — short runs. The dat
tripwires are always-on and free. NEVER judge visuals under `ACDREAM_PROBE_FLAP`.
## 7. What ELSE is open (do not drift into these)
Priority order (set 2026-06-10, digest carries it): **this investigation (#105+#110)**
**#107** indoor-login spawn wedge (physics; `ACDREAM_CAPTURE_RESOLVE` apparatus ready) →
**#108** cellar-ascent grass sweep + **#109** far-exit-door oscillation (one render
session, probe captures at their spots) → **#99/A6.P4** per-cell shadow architecture
(planned phase). The §4 flood strobe is FIXED (`dac8f6a`, user-gated) — its conformance
gate (`CornerSweep_FloodIsCompleteAndMonotone`) and the corner-seal characterization
(`CameraCornerSealReplayTests`) must stay green through any change here.
## 8. Today's commit ledger (context for blame/diff archaeology)
| Commit | What |
|---|---|
| `682cba3` | [clip-route] probe apparatus (outdoor flap) |
| `c4df241` | outdoor full-world flap FIX (EnvCellRenderer DepthMask(false) leak → depth-clear no-op) |
| `df2ef7c` | flap close-out doc |
| `b21bb28` | corner-seal replay — camera-penetration hypothesis REFUTED (openings, not walls) |
| `482b0de` | corner-seal handoff doc |
| `dac8f6a` | §4 flood strobe FIX (homogeneous reciprocal clip + collinear-aware dedup) — user-gated |
| `137b4f2` | near plane 1.0→0.1 (retail znear) + issues #107#109 filed |
| `8bd3492` | near plane reverted to 1.0 pending #110; #110 filed |
Test baseline: App **223**, Core **1377** green + 4 pre-existing #99-era failures
(DoorBugTrajectoryReplay ×2 / DoorCollisionApparatus / BSPStepUp) + 1 skip, UI **420**,
Net **294**. ACE on `127.0.0.1:9000`, `testaccount/testpassword`, `+Acdream`.