acdream/docs/research/2026-06-10-105-110-white-textures-nearplane-handoff.md
2026-06-10 12:17:09 +02:00

14 KiB
Raw Blame History

HANDOFF — #105 intermittent missing indoor textures × #110 near-plane correlation

CLOSED 2026-06-10 (same day). Root cause: the per-frame staged-texture flush (WB GameScene.cs:975ObjectMeshManager.GenerateMipmaps()) was dropped in the N.4/O-T4 extraction; fix c787201, znear=0.1 re-landed d4b5c71. §4's "only credible link" (upload pressure) was exactly right. Read the close-out instead: 2026-06-10-105-110-CLOSED-staged-texture-flush-drop.md. This document is historical.

Date: 2026-06-10 (late). Branch: claude/thirsty-goldberg-51bb9b, HEAD 8bd3492. Status: #105 struck twice today with the dat-side tripwires SILENT (= GL-side); the retail near-plane fix (137b4f2, 0.1 m) was bisect-implicated in those two runs and REVERTED (8bd3492) pending this investigation. One investigation, three payoffs: attribute + kill the chronic #105, settle whether the near plane is innocent (#110), and re-land znear=0.1 which closes the §4 corner see-through-wall.

Read this top-to-bottom before touching code. The render digest (claude-memory/project_render_pipeline_digest.md) carries the distilled state + the DO-NOT-RETRY table — this doc is the deep-dive for THIS investigation.


0. TL;DR

  1. #105 (chronic since ~2026-06-08): indoor wall/cell textures intermittently render wrong ("missing" — exact appearance white-vs-invisible NOT yet confirmed by the user; question outstanding). Today it struck on 2 consecutive launches. The four dat-side tripwires ([dat-miss]/[tex-miss]/[tex-skip]/[cell-miss], commit 7433b70) produced ZERO output on both bad runs → per the #105 protocol the failure is GL-side: staged mesh/texture upload, bindless handle creation/residency, or the per-batch handle plumbing — NOT a failed dat read.
  2. #110: both bad runs were on the near-plane build (znear=0.1, 137b4f2); the very next run with znear=1.0 (working-tree bisect) rendered clean. 2-bad-on-0.1 / 1-good-on-1.0 is suggestive, not conclusive — #105 is intermittent and could have coincided. No mechanism is known by which znear touches texturing (see §4 for the honest candidate list). The change is reverted; all four cameras carry a ⚠️ comment pointing here.
  3. The leverage: if #105 is independent (most likely), fixing it exonerates the near plane → re-land 0.1 → §4 corner fix complete. If the near plane genuinely raises #105's trigger probability (e.g. more close-up geometry → more upload pressure), the fix is still in the #105 path — the near plane just becomes the best repro lever.

1. Today's run matrix (the evidence)

Run (log, worktree root, untracked) Build (near) Indoor textures Notes
flood-fix-gate.log dac8f6a (1.0) OK user gated the flood fix: transitions clean
flood-fix-gate2.log dac8f6a (1.0) OK (no complaint) #107 spawn wedge run
flood-fix-gate3.log dac8f6a (1.0) OK (no complaint) #107 again
nearplane-gate.log 137b4f2 (0.1) UNKNOWN user asked to relaunch without detail
nearplane-gate2.log 137b4f2 (0.1) MISSING tripwires silent (checked: 0/0/0/0)
nearplane-gate3.log 137b4f2 (0.1) MISSING fresh launch, same
nearplane-bisect.log tree (1.0 on RetailChaseCamera) OK the bisect run

Also relevant: the clean-launch #105 occurrence on 2026-06-09 (35-line log, zero errors, PRE-near-plane) — proof #105 strikes on znear=1.0 builds too. The near plane cannot be the sole cause; the open question is independence vs trigger-probability.

2. #105 history — what is already settled (DO NOT REDO)

From docs/research/2026-06-09-dat-reader-thread-safety-investigation.md + the digest:

  • Concurrent dat READS are SAFE (Chorizite.DatReaderWriter 2.1.7): source audit + the in-tree hammer DatConcurrencyStressTests (~1.1 M concurrent reads, zero anomalies). The "thread-unsafe dat reader" lore is refuted for the read path. Do not re-litigate.
  • The teardown AccessViolations were dispose-during-read (decode pool + streamer not quiesced before DatCollection.Dispose unmapped views) — FIXED 8fadf77.
  • The "heavy probes cause white walls" framing is PARTIAL at best — a clean 35-line launch reproduced white walls; a heavily-probed run rendered fine. Probe load skews timing (still avoid ACDREAM_PROBE_FLAP for visual gates) but is not the cause.
  • Every silent dat-miss exit is tripwired (7433b70): [dat-miss] (DatCollection returns null), [tex-miss]/[tex-skip] (texture resolve/upload skips), [cell-miss] (EnvCell load misses). Zero output when healthy. Both of today's bad runs: zero output → the dat → decode → staged-data side delivered; the loss is between staging and the draw.

3. The GL-side texture path — anatomy + where it can lose textures

The modern pipeline (N.4/N.5, mandatory — see reference_modern_rendering_pipeline.md):

  1. Decode/stage: ObjectMeshManager.PrepareMeshDataAsync(id, isSetup) background- decodes mesh + texture data → auto-enqueues to _stagedMeshData.
  2. Drain: WbMeshAdapter.Tick() (render thread, per frame) drains the staged queue, creates GL resources, populates AcSurfaceMetadataTable (per-batch translucency / luminosity / fog metadata).
  3. Texture upload: TextureCache GetOrUpload*Bindless → GL texture (parallel Texture2DArray uploads via UploadRgba8AsLayer1Array) → glGetTextureHandleARBglMakeTextureHandleResidentARB. Returns the 64-bit handle.
  4. Per-batch plumbing: the handle lands in ObjectRenderBatch.BindlessTextureHandle.
    • Entities: WbDrawDispatcher Phase 5 uploads _batchSsbo (binding=1, (uvec2 handle, uint layer, uint flags) per group).
    • Cell shells: EnvCellRenderer.RenderModernMDIInternal packs ModernBatchData { TextureHandle, TextureIndex }_modernBatchBuffer (SSBO binding=1, bound at EnvCellRenderer.cs:1211).
  5. Sampling: mesh_modern.frag constructs sampler2DArray(handle) from the uvec2. A zero handle samples garbage/black/white (undefined) — this is the classic "white walls" appearance.

Loss candidates between staging and draw (ranked):

  • (a) Zero handle at draw time — the batch was prepared before its texture upload completed, and nothing back-patches the handle. Known to exist transiently (textures pop in); a bug would make it PERSIST. ⚠️ There is an EXISTING probe for exactly this: ACDREAM_PROBE_SHELL=1 (RenderingDiagnostics.ProbeShellEnabled) prints, per visible cell, gfxObj/batch counts AND zh= (zero-bindless-handle batch count) — see RenderingDiagnostics.cs:120-130. One bad-run launch with this probe splits the search space in half (zh>0 ⇒ upload/handle side; zh==0 ⇒ resident-but-wrong-content or sampling/state side).
  • (b) Residency loss / never-made-resident — handle non-zero but MakeTextureHandleResidentARB skipped or undone → same visual, but zh probe reads 0. Needs a residency assert (glIsTextureHandleResidentARB sweep) or RenderDoc.
  • (c) Upload raced/dropped under pressureMaxCompletionsPerFrame (QualityPreset) caps streaming completions per frame; a drop/requeue bug under burst load would lose whole cells' textures. Would likely show as some cells white, others fine.
  • (d) Texture content wrong but handle valid — array-layer mixups (zh==0, content white). RenderDoc territory.

4. #110 — what znear can and cannot plausibly do (senior-dev honest list)

znear enters the system in exactly one object: the projection matrix (CreatePerspectiveFieldOfView(FovY, aspect, znear, 5000f) in RetailChaseCamera / ChaseCamera / FlyCamera / OrbitCamera — all currently 1.0 with ⚠️ comments). Downstream consumers of viewProj:

Consumer Effect of 0.1 vs 1.0 Texture relevance
Rasterization geometry 0.11.0 m from the eye now draws; depth distribution shifts (D24 @ 5 m: ~1.5 µm → ~15 µm) none direct; z-fighting would flicker, not "lose" textures
Frustum culls (terrain/entity/EnvCell prepare) strictly MORE visible (near plane closer) more batches prepared per frame → more uploads in flight → raises (a)/(c) trigger probability ← the only credible #105×#110 link found
PortalVisibilityBuilder flood viewProj changes per-vertex w by near-plane row only in z-row; flood clip planes are (nx,ny,0,dw) — x,y,w-based; the flood is conformance-gated (CornerSweep_FloodIsCompleteAndMonotone) none
gl_ClipDistance regions / terrain UBO x,y,w-based, near-independent none
Doorway scissor computed from NdcAabb, projection-independent at the box level none

Conclusion to verify, not assume: the most credible story is that znear=0.1 makes close-up geometry (the wall right behind the camera, the doorframe you're brushing) newly visible, inflating per-frame prepare/upload pressure indoors, which raises the probability of the pre-existing #105 loss. If true: fixing #105 exonerates the near plane entirely. The alternative (0.1 breaks texturing via a mechanism not in this table) needs RenderDoc evidence before being believed.

5. Investigation plan (staged, evidence-first)

Phase A — attribute with the existing probe (cheap, decisive split):

  1. Launch with ACDREAM_PROBE_SHELL=1 (+ the always-on dat tripwires). Flip-launch until a bad run reproduces (today it was 2/3 on the 0.1 build — consider temporarily re-applying 0.1 to the working tree as the REPRO LEVER ONLY, clearly uncommitted).
  2. On a bad run: read [shell] lines for the affected cells. zh>0 ⇒ path (a)/(b): zero/never-patched handles — go to Phase C1. zh==0 ⇒ path (b)/(d) — go to Phase C2.
  3. ALSO capture the user's answer: white surfaces vs invisible walls (outstanding question — invisible would point at visibility/depth instead and reshape this plan).

Phase B — settle the #110 correlation statistically (parallel, mechanical): Alternate launches 0.1 / 1.0 (working-tree flip on RetailChaseCamera only), ≥4 runs per arm, record texture state per run (the [shell] zh= counts make this detectable WITHOUT user eyes if path (a)). Independence ⇒ bad runs appear in both arms. The 2026-06-09 clean-launch occurrence already proves 1.0 is not immune.

Phase C — root-cause:

  • C1 (zero handles): instrument the staging→handle path: log every batch that reaches the draw SSBO with handle==0 (entity + EnvCell sides), plus WbMeshAdapter.Tick drain counts and TextureCache upload completions per frame. Find who created a batch before its texture and never back-patched. Fix = the back-patch / ordering, NOT a retry loop.
  • C2 (valid handles, wrong output): RenderDoc the bad frame (GPU truth): inspect binding=1 SSBO contents, handle residency, sampled texture content, and the draw state of an affected wall batch.

Phase D — close out: fix #105 root cause → flip-test again (both arms clean) → re-land znear=0.1 (re-apply the 137b4f2 payload: 4 cameras + restore the retail citation comments) → user re-gates the §4 corner press (wall must stay solid at the camera) + a distance scan for any new z-shimmer (none expected; retail ships 0.1).

6. Repro notes + session-ops gotchas (cost real time today)

  • Repro spot: Holtburg houses near the player's parked position (the user was trying on a house interior; exact house id unconfirmed — textures were missing across the interior). Frequency today: 2 of 3 launches on the 0.1 build.
  • #107 interference: logging in while parked INDOORS wedges the player (stuck in air/wall, 3-for-3 today — filed). For THIS investigation prefer ending test sessions with the character OUTDOORS so logins are clean. If wedged: relaunch; it intermittently recovers.
  • ACE session hold: graceful window close ⇒ ~35 s; hard kill ⇒ ~3 min of session failed (exit 29). The launch protocol + wait loop used all day is in this session's transcript; auto-entered player mode is the in-world marker.
  • ⚠️ PowerShell 5.1 Get-Content/Set-Content MANGLES UTF-8 source files (reads CP1252, writes mojibake — corrupted all four camera files today; recovered via git checkout + redoing edits with the Edit tool). Never bulk-edit source with PS5.1 string replace.
  • Tee-Object logs are UTF-16LE — Python analyzers must BOM-detect; PowerShell Select-String handles them natively.
  • Probes: ACDREAM_PROBE_SHELL is heavy-ish (per-prepare dumps) — short runs. The dat tripwires are always-on and free. NEVER judge visuals under ACDREAM_PROBE_FLAP.

7. What ELSE is open (do not drift into these)

Priority order (set 2026-06-10, digest carries it): this investigation (#105+#110)#107 indoor-login spawn wedge (physics; ACDREAM_CAPTURE_RESOLVE apparatus ready) → #108 cellar-ascent grass sweep + #109 far-exit-door oscillation (one render session, probe captures at their spots) → #99/A6.P4 per-cell shadow architecture (planned phase). The §4 flood strobe is FIXED (dac8f6a, user-gated) — its conformance gate (CornerSweep_FloodIsCompleteAndMonotone) and the corner-seal characterization (CameraCornerSealReplayTests) must stay green through any change here.

8. Today's commit ledger (context for blame/diff archaeology)

Commit What
682cba3 [clip-route] probe apparatus (outdoor flap)
c4df241 outdoor full-world flap FIX (EnvCellRenderer DepthMask(false) leak → depth-clear no-op)
df2ef7c flap close-out doc
b21bb28 corner-seal replay — camera-penetration hypothesis REFUTED (openings, not walls)
482b0de corner-seal handoff doc
dac8f6a §4 flood strobe FIX (homogeneous reciprocal clip + collinear-aware dedup) — user-gated
137b4f2 near plane 1.0→0.1 (retail znear) + issues #107#109 filed
8bd3492 near plane reverted to 1.0 pending #110; #110 filed

Test baseline: App 223, Core 1377 green + 4 pre-existing #99-era failures (DoorBugTrajectoryReplay ×2 / DoorCollisionApparatus / BSPStepUp) + 1 skip, UI 420, Net 294. ACE on 127.0.0.1:9000, testaccount/testpassword, +Acdream.