diff --git a/docs/research/2026-05-10-tier1-retry-handoff.md b/docs/research/2026-05-10-tier1-retry-handoff.md new file mode 100644 index 0000000..4612468 --- /dev/null +++ b/docs/research/2026-05-10-tier1-retry-handoff.md @@ -0,0 +1,203 @@ +# Phase Post-A.5 — Tier 1 Retry (ISSUE #53) — Cold-Start Handoff + +**Created:** 2026-05-10, immediately after closing ISSUES #52 (lifestone) + #54 (JobKind plumbing) and merging to main. +**Audience:** the next agent picking up Priority 3 of the Post-A.5 polish phase. +**Purpose:** drop straight into the Tier 1 entity-classification cache retry without re-litigating what the prior session settled. + +--- + +## TL;DR + +Post-A.5 polish was sized at three priorities. **2 of 3 shipped to main** during the 2026-05-10 session; **only Priority 3 (Tier 1 retry, ISSUE #53) remains.** Tier 1 is the biggest perf headroom in the post-A.5 phase: it should drop the entity dispatcher cpu_us median from ~3.5 ms to ~1-1.5 ms, putting the dispatcher inside the spec's 2.0 ms budget and unlocking ~300-400 FPS at standstill. + +The first Tier 1 attempt (commit `3639a6f`, reverted at `9b49009`) broke animation. The next attempt MUST start with an animation-mutation audit. **This handoff has the audit pre-started** — there's specific evidence captured below that the previous handoff didn't have. + +Sized: ~5-7 days including audit + design + spec + implementation + visual gate. + +--- + +## Where main is + +- **`main` HEAD: `da08490`** — Merge of `claude/cranky-varahamihira-fe423f`. Includes the lifestone fix + JobKind plumbing. +- **CLAUDE.md "Currently in flight"** updated to *"Post-A.5 polish — Tier 1 retry (only remaining priority)"*. +- **`docs/ISSUES.md`** has both #52 and #54 in *Recently closed* with full root-cause writeups; only #53 remains in *Active issues*. +- **N.5b conformance sentinel: 94/94.** Full suite: 1688/1696 passing (8 pre-existing physics/input failures unchanged across all session work). + +Recent commit chain on main (newest first): + +| SHA | Subject | +|---|---| +| `da08490` | Merge branch 'claude/cranky-varahamihira-fe423f' — Post-A.5 polish: close #52 (lifestone) + #54 (JobKind plumbing) | +| `9a55354` | docs(post-A.5 #54): close JobKind plumbing issue + update CLAUDE.md flight status | +| `bf31e59` | fix(streaming): close #54 — plumb JobKind through BuildLandblockForStreaming | +| `b19f1d1` | docs(post-A.5 #52): close lifestone issue + update CLAUDE.md flight status | +| `e40159f` | fix(render): close #52 — lifestone visible (alpha-test + cull + uDrawIDOffset) | +| `c111312` | docs(post-A.5): cold-start handoff for the next session (the prior handoff this work used) | + +--- + +## What shipped this session + +### Priority 1 — ISSUE #52 (lifestone missing) — closed by `e40159f` + +Three independent root causes regressed with the WB rendering migration (Phase N.5 retirement amendment, commit `dcae2b6`, 2026-05-08): + +1. **Alpha-test discard** in `mesh_modern.frag` transparent pass killed high-α pixels of dat-flagged transparent surfaces. The lifestone crystal core (surface `0x080011DE`) decoded with α≥0.95, so 100% of fragments were discarded. Fix: remove `α >= 0.95 discard` from transparent pass; keep `α < 0.05 discard` as a fragment-cost optimization. +2. **Cull state regression**: `WbDrawDispatcher.Draw` Phase 8 had no GL cull state — Phase 9.2's `Enable(CullFace) + Back + CCW` setup (commit `6f1971a`, 2026-04-11) was lost when the legacy `StaticMeshRenderer` was deleted. Closed-shell translucents composited back-faces over front-faces in iteration order under `DepthMask(false)`. Fix: re-establish Phase 9.2's GL state at the top of Phase 8. +3. **`uDrawIDOffset` indexing bug**: `gl_DrawIDARB` resets to 0 at the start of each `glMultiDrawElementsIndirect`, so the transparent pass was reading `Batches[0..transparentCount)` (the OPAQUE section) instead of `Batches[opaqueCount..end)`. The lifestone flickered to whatever opaque batch sorted to index 0 each frame. Fix: add `uniform int uDrawIDOffset` to `mesh_modern.vert`, set per-pass in dispatcher (0 for opaque, `_opaqueDrawCount` for transparent). Mirrors WB's `BaseObjectRenderManager.cs:845`. + +User-confirmed visually via `+Acdream` test character at the Holtburg outdoor lifestone (Z=94 platform). + +### Priority 2 — ISSUE #54 (JobKind plumbing) — closed by `bf31e59` + +`LandblockStreamer.cs` primary ctor signature changed from `Func` to `Func`. A back-compat overload preserves the old signature for the 5 ctor sites in `LandblockStreamerTests.cs` (no test changes needed). `BuildLandblockForStreaming(uint, JobKind)` in `GameWindow.cs` early-outs for `LoadFar` with a heightmap-only path. The Bug A post-load entity strip in `LandblockStreamer.HandleJob` is retained as a `Debug.Assert` + Release safety net. + +Per-LB worker cost on far-tier dropped from ~tens of ms (full hydration including `LandBlockInfo` + `SceneryGenerator` + interior cells) to ~sub-ms (single `LandBlock` dat read). + +### Memory entry from this session + +`feedback_wb_migration_state_audit.md` — captures the meta-lesson that WB-migration phases need a systematic GL-state and shader-uniform diff vs the legacy renderer being replaced. Future phases at risk: Sky/Particles modern path migration, EnvCell modern path, Shadow mapping. Also captures the workflow lesson: when the user says *"we had this nailed down before"*, the first move is `git log -- ` BEFORE adding new diagnostic instrumentation. + +--- + +## Priority 3 — ISSUE #53 — Tier 1 entity-classification cache retry + +### What the first attempt was and why it failed + +Commit `3639a6f` (reverted at `9b49009`) cached `meshRef.PartTransform` baked into per-(entity, batch) classification at first-frame visit. For static entities this is stable; for animated entities the cache froze the pose and NPCs/players stopped animating. Some buildings also showed at wrong positions (likely entities incorrectly flagged as animated). + +The "trust MeshRefs as the source of truth" comment in the dispatcher gave false confidence. MeshRefs IS the source of truth, but it's mutated EVERY frame for animated entities. + +### The audit (PRE-STARTED in the prior session — read this carefully) + +The previous handoff and ISSUE #53 describe the bug as *"AnimationSequencer mutates `meshRef.PartTransform` every frame to apply the current skeletal pose."* **That framing is technically wrong** in a way that matters for the retry design. Discovered during the post-A.5 lifestone session: + +- `MeshRef` at `src/AcDream.Core/World/MeshRef.cs:15` is a `readonly record struct` — its fields **cannot be mutated in place**: + ```csharp + public readonly record struct MeshRef(uint GfxObjId, Matrix4x4 PartTransform) + ``` +- The actual per-frame mutation for animated entities is the **entire `MeshRefs` LIST replacement** at `src/AcDream.App/Rendering/GameWindow.cs:7474-7553`: + ```csharp + var newMeshRefs = new List(partCount); + // ... loop building per-part transforms from sequencer.Advance(dt) ... + ae.Entity.MeshRefs = newMeshRefs; + ``` +- The source of truth for *which* entities go through that per-frame path is the `_animatedEntities` dictionary at `GameWindow.cs:160`: + ```csharp + private readonly Dictionary _animatedEntities = new(); + ``` + Population: `_animatedEntities[entity.Id] = new AnimatedEntity{...}` at GameWindow.cs:2724 (spawn). Removal: `_animatedEntities.Remove(...)` at GameWindow.cs:2935 (despawn). + +**Therefore: a static entity is one whose `Id` is NOT in `_animatedEntities`.** Its MeshRefs list is the same instance from spawn until rare events (ObjDesc / palette swap / part hide). Other static-entity write sites that must be invalidation-aware: +- `src/AcDream.App/Rendering/GameWindow.cs:2333` and `:2365` — ObjDescEvent / AnimPartChange events rebuild a `MeshRef` element. Network-driven, infrequent. +- `src/AcDream.App/Rendering/GameWindow.cs:2524` — entity scale apply at spawn (one-shot). +- Lines 4682-4924, 4996-5074 — dat-side hydration paths in OnLoad / scenery / interior. Spawn-time only. + +### What this means for cache design + +The cleanest design is now clearer than the original handoff suggested: + +**Recommended approach (option a from the original handoff): static-only cache with explicit invalidation hooks.** + +1. Cache the (entity, batch) → InstanceGroup-key + model-matrix mapping for entities where `_animatedEntities.ContainsKey(entity.Id) == false`. +2. Animated entities skip the cache entirely; they go through today's per-frame `ClassifyBatches` path. +3. Invalidate the cache for an entity on: + - **ObjDesc / AnimPartChange events** (`GameWindow.cs:2333, 2365`) — rebuild that entity's cache entry. + - **Palette override changes** (rare; usually only on initial server spawn or a re-equip event). + - **Entity despawn** — drop the cache entry. +4. Static entities never animate. The dispatcher's per-frame work for cached entities reduces from "walk + classify all batches" to "walk + lookup-and-emit-pre-classified". + +Why this is safer than the first attempt: the first attempt cached the POSE (model matrix). This attempt would cache only the (group key, texture handle, blend mode, per-part `meshRef.PartTransform * entityWorld` for the spawn-time stable subset). Animation never enters the cache surface. + +### Cache design options reconsidered + +(a) **Static-only cache (recommended).** As described above. Clean invariant: animated entities skip the cache; static entities go through it. Requires careful enumeration of all writes to `entity.MeshRefs` for static entities (see audit list above) so each one fires invalidation. + +(b) **Dynamic-aware cache with invalidation hooks.** Cache everything but expose `InvalidateEntity(uint)` / `RefreshEntityPalette(uint)` hooks; wire from network handlers. More complex but might let some animated entities also benefit if their per-frame mutations are localized. NOT RECOMMENDED for a first retry — error-prone and the first attempt already failed at this scope. + +(c) **Static-only + animated-bypass + DEBUG cross-check.** Like (a), but in DEBUG builds, log a warning every frame if a cached entity's `MeshRefs` reference no longer matches the cached snapshot (catches mis-classified dynamics). Belt-and-suspenders. Recommended IF you're nervous about the audit being incomplete. + +### Acceptance criteria (from the original handoff, refined) + +- Build green; existing 999+ tests pass; 8 pre-existing physics/input failures stay at 8. +- 1-3 new tests covering: cache hit for static entity (lookup), cache bypass for animated entity (no-op), cache invalidation on entity despawn, cache invalidation on ObjDesc/palette event. +- N.5b conformance sentinel intact (89+ tests; in this session it's 94/94 — must stay clean). +- Visual gate: launch + walk Holtburg → North Yanshi at horizon-safe preset; confirm: + - Animation works (NPCs, player character animate normally — including the lifestone crystal closed by #52). + - Buildings at correct positions. + - No new visual regressions. +- Perf gate (with `[WB-DIAG]` under `ACDREAM_WB_DIAG=1`): + - Entity dispatcher cpu_us median drops from ~3.5 ms to ≤2.0 ms (matches spec budget). + - p95 stays ≤2.5 ms. + +--- + +## Files to read before brainstorming + +In rough order: + +1. **This handoff** end-to-end — captures audit insights from the prior session that the original handoff didn't have. +2. **`docs/research/2026-05-10-post-a5-polish-handoff.md`** — the prior handoff. §"Priority 3" has the original (slightly outdated) framing of the bug. Read for context but trust THIS handoff's audit insights over its. +3. **`docs/ISSUES.md` issue #53** — the issue's own description (now updated post-#52/#54 close). +4. **`docs/superpowers/specs/2026-05-09-phase-a5-two-tier-streaming-design.md`** — A.5 spec for the entity dispatcher's data-flow context (esp. §4.10 Quality Preset and §11 deferred items). +5. **`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md`** — the perf-tier roadmap. Tier 1 is in scope; Tier 2 + Tier 3 are explicitly NOT (those are dedicated multi-week phases). +6. **`memory/feedback_wb_migration_state_audit.md`** — the new memory entry on WB migration state-loss patterns. Tier 1 doesn't touch the WB migration directly, but the meta-lesson "audit before assume" is exactly what this priority needs. +7. **`memory/project_phase_a5_state.md`** — the 5 gotchas. **Critical for avoiding the same traps**, especially #3 (caching mutable per-frame state breaks animation silently) — the exact bug the first Tier 1 attempt hit. +8. **`src/AcDream.Core/World/MeshRef.cs`** — confirm the `readonly record struct` shape; understand that "mutating PartTransform" actually means "replacing the whole MeshRef record." +9. **`src/AcDream.App/Rendering/GameWindow.cs:7340-7560`** — the per-frame animation rebuild loop. Read this end-to-end for the audit. Find every line that writes to `entity.MeshRefs` for animated entities. +10. **`src/AcDream.App/Rendering/GameWindow.cs:160` + lines 2710-2760, 2920-2940** — `_animatedEntities` declaration + spawn/despawn population. +11. **`src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`** — `Draw` and `ClassifyBatches`. Where the cache will land. +12. **`src/AcDream.Core/Physics/AnimationSequencer.cs`** — the per-frame animation engine. Audit any field it mutates that the dispatcher reads. +13. **`src/AcDream.Core/Physics/AnimationHookRouter.cs`** — secondary mutation source via animation hooks. + +--- + +## Workflow for the next session + +1. **Read this handoff in full.** +2. **Verify build green:** `dotnet build`. Verify ~1688 tests pass: `dotnet test --no-build`. Verify N.5b sentinel: filter `TerrainSlot|TerrainModernConformance|Wb|MatrixComposition|TextureCacheBindless|SplitFormulaDivergence` → expect 94 passing. +3. **Read the files above** in order. Especially deep on §"Files to read" #8-#13. +4. **Audit step (1-2 days):** open a fresh research note `docs/research/2026-05-10-tier1-mutation-audit.md` and write down: + - Every code path that writes `entity.MeshRefs = ...` for any entity. + - Tag each as **STATIC** (one-shot at spawn or rare event) or **DYNAMIC** (per-frame). + - For each STATIC write, identify the trigger (network event, scale apply, etc.) and design the invalidation hook. + - For each DYNAMIC write, confirm it fires only for entities in `_animatedEntities` (which means cache bypass is the right answer). +5. **Spec (~1 day):** brainstorm the cache design with the user (use `superpowers:brainstorming`). Write `docs/superpowers/specs/2026-05-10-issue-53-tier1-cache-design.md`. Include the audit findings, the chosen cache approach (probably option (a)), the invariants, the invalidation API, the test plan, the perf-gate measurement plan. +6. **Implement (~2-3 days):** TDD via `superpowers:test-driven-development`. Tests first for cache hit/miss/invalidation, then implementation in `WbDrawDispatcher`. Wire invalidation hooks into the relevant write sites in `GameWindow.cs`. +7. **Visual gate:** launch + walk; confirm animation works on a moving NPC; confirm static buildings/scenery still render at correct positions; confirm lifestone (closed by #52) still renders. +8. **Perf gate:** capture `[WB-DIAG]` cpu_us median + p95 with `ACDREAM_WB_DIAG=1` at horizon-safe preset (NEAR=4, FAR=12). Compare to today's ~3.5 ms baseline; expect ≤2.0 ms. +9. **Ship:** commit, close #53 in ISSUES.md, update CLAUDE.md "Currently in flight" (this would close out the post-A.5 polish phase entirely), update memory with any new gotchas captured during the audit/implementation. +10. **Next phase after #53 ships:** N.6 (perf polish) per the roadmap. Or escalate to Tier 2 (static/dynamic split with persistent groups) per `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` if Tier 1 alone doesn't hit the perf target. + +--- + +## Things to NOT do + +- **Don't skip the audit.** The whole reason the first attempt failed was that the audit was implicit and incomplete. The audit step should produce a written list of every MeshRefs write site, classified static vs dynamic, before any cache code is written. +- **Don't bundle Tier 2 or Tier 3 into this phase.** Those are dedicated multi-week phases per `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md`. If the audit reveals Tier 1 alone can't hit the perf target, file a follow-up issue and escalate as a separate phase. +- **Don't re-add the `Tier1` cache that was reverted.** Start fresh after the audit. Cherry-picking commit `3639a6f` reintroduces the animation freeze. +- **Don't break the N.5b conformance sentinel.** Run the filter on every commit: + ``` + dotnet test --no-build --filter "FullyQualifiedName~TerrainSlot|FullyQualifiedName~TerrainModernConformance|FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition|FullyQualifiedName~TextureCacheBindless|FullyQualifiedName~SplitFormulaDivergence" + ``` + Expect 94 passing, 0 failures. +- **Don't skip the visual gate.** Animation has been the highest-risk regression in this codebase repeatedly (Tier 1 first attempt, the lifestone crystal in this session, the foundry statue earlier). Confirm visually with a moving animated NPC, a stationary building, and the lifestone before declaring done. +- **Don't trust "it was working in prod before."** That was the first Tier 1 attempt's posture. The audit is what makes it actually safe. + +--- + +## Reference: Tier 1 perf math + +Per the perf-tier roadmap and A.5 final state: +- **Today** (post-A.5 ship + #52/#54): entity dispatcher cpu_us median ~3.5 ms at radius=12 on Radeon RX 9070 XT @ 1440p. ~200-240 FPS at standstill. +- **After Tier 1**: ~1.0-1.5 ms median expected. ~300-400 FPS at standstill. Inside the spec's 2.0 ms budget. +- **After Tier 2 (separate phase)**: ~0.5-1.0 ms. ~400-600 FPS. +- **After Tier 3 (GPU compute culling, separate phase)**: ~0.05 ms. ~600-1000+ FPS. + +Tier 1 is the lowest-risk, highest-leverage perf win remaining for the post-A.5 polish phase. + +--- + +Good luck. The audit is the load-bearing thing — invest in it. The implementation is mechanical once the audit is solid. + +Holler at the user if any of the audit reveals a write site that doesn't fit the static/dynamic dichotomy cleanly.