close #125: bounded upload retry kills the sticky-drop debt (failed GL uploads were never re-staged)

The GL root cause was fixed in fcade06 (the gpu_us query-ring stale
errors). This closes the remaining design debt: a genuinely-failed
UploadMeshData was dropped permanently.

Exact mechanism (traced this session): UploadMeshData's catch returns
null, the staged item is already consumed, and _renderData stays empty -
but the prepared data lingers in _cpuMeshCache, so the #128 EnsureLoaded
re-arm hits PrepareMeshDataAsync's CPU-cache short-circuit
(ObjectMeshManager.cs:448-453) which returns the cached data WITHOUT
re-staging it for upload. The mesh stays invisible until CPU-cache
eviction - session-sticky under low cache pressure (the in-tower
scenario).

Fix: the per-frame Tick drain (WbMeshAdapter) now re-stages a failed
upload for the NEXT frame via ObjectMeshManager.UploadOrRequeue, bounded
by MaxUploadRetries (3). The attempt counter lives on the ObjectMeshData
object so it resets to 0 naturally on re-prepare. Re-stages are
collected and re-enqueued AFTER the drain loop, never inside it, so a
deterministic failure cannot spin the queue within a single frame; past
the cap it gives up with a loud [up-retry] ... giving up line - a
genuine GL defect now surfaces instead of the old silent permanent drop
or an unbounded retry storm. Retail loads content synchronously and has
no such failure mode; this converges the async pipeline toward that
guarantee.

The uncaught GenerateMipmaps path (open-question c) is INTENTIONALLY
left to surface errors - a blanket catch there would mask future real
defects (no-workarounds rule), and its trigger (fcade06) is retired.

No visual gate (robustness). Build green; App.Tests 264 + WbMeshAdapter
tests green. No GL-context test seam exists for the upload path, so the
bounded retry is verified by construction + the regression suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Erik 2026-06-13 10:27:26 +02:00
parent bf18a54369
commit 8682a8db70
3 changed files with 81 additions and 8 deletions

View file

@ -4453,8 +4453,9 @@ aperture instead of see-through to the world behind.
## #125 — GL InvalidOperation during staged texture upload: failed uploads are STICKY (never retried) + uncaught crash in GenerateMipmaps
**Status:** ROOT CAUSE FIXED 2026-06-11 (`fcade06`, live-verified) —
remaining: the sticky-drop design debt (below).
**Status:** CLOSED 2026-06-12 — the GL root cause was fixed `fcade06`
(2026-06-11, live-verified); the remaining sticky-drop DESIGN DEBT is now
fixed too (bounded upload retry, below). No visual gate (robustness).
**RESOLVED (root cause):** the GL errors were the gpu_us QUERY RING's own
— a glGenQueries name isn't a query object until first glBeginQuery, and
@ -4469,11 +4470,28 @@ slot; read only begun queries. Live-verified in-tower: 0 [wb-error]
time under pview, meshMissing=0. **Normal runs (WB_DIAG off) never had
these errors — this mechanism is RETIRED for #119.**
**Remaining debt (keep open under this number):** UploadMeshData removes
the preparation task BEFORE uploading, so any genuinely-failed upload is
never retried — permanently invisible mesh with one [wb-error] line.
The trigger is gone but the design flaw isn't; add retry/re-prepare
semantics in a maintenance pass.
**Remaining debt — FIXED 2026-06-12 (bounded upload retry):** the exact
stick was the CPU-cache short-circuit, not just the early `TryRemove`: a
failed `UploadMeshData` (catch → null) consumed the staged item and left
`_renderData` empty while the prepared data lingered in `_cpuMeshCache`,
so `PrepareMeshDataAsync`'s cache-hit path (`ObjectMeshManager.cs:448-453`)
returned it WITHOUT re-staging → never re-uploaded until CPU-cache
eviction (effectively session-sticky under low cache pressure). Fix: the
Tick drain (`WbMeshAdapter.cs`) now re-stages a failed upload for the NEXT
frame via `ObjectMeshManager.UploadOrRequeue`, bounded by
`MaxUploadRetries` (3) using a counter on the `ObjectMeshData` object
(resets to 0 on re-prepare). Re-stages are collected and re-enqueued
AFTER the drain loop — never inside it — so a deterministic failure can't
spin the queue in one frame; past the cap it gives up with a loud
`[up-retry] … giving up` line (surfaces a genuine GL defect instead of
the old silent permanent drop). Retail loads synchronously and has no
such failure mode; this converges the async pipeline toward that
guarantee. Build + App.Tests (264) green; no GL-context test seam exists
for the upload path so the retry is verified by construction + the
regression suite. The uncaught `GenerateMipmaps` path (open-question c)
is INTENTIONALLY left to surface errors — adding a blanket catch there
would mask future real defects (no-workarounds rule); its trigger
(`fcade06`) is already retired.
**Filed:** 2026-06-11 (in-tower WB_DIAG launch, `tower-wbdiag3.log` — preserved in the worktree root)
**Component:** render — WB staged texture pipeline (ObjectMeshManager / ManagedGLTextureArray)