acdream/docs/research/2026-05-10-tier1-retry-handoff.md
Erik 15376c7a73 docs(post-A.5): cold-start handoff for the Tier 1 retry session (#53)
After the post-A.5 lifestone (#52) + JobKind plumbing (#54) work shipped,
only Priority 3 (Tier 1 entity-classification cache retry, ISSUE #53)
remains. This handoff captures the audit insights gathered during the
#52 investigation that the original post-A.5 handoff didn't have:

- MeshRef is a `readonly record struct` — its fields can NOT be mutated
  in place. The actual per-frame mutation for animated entities is the
  entire MeshRefs LIST replacement at GameWindow.cs:7474-7553. This
  reframes the cache design.
- _animatedEntities dict at GameWindow.cs:160 is the source of truth
  for which entities go through the per-frame rebuild path.
- Static entity = entity.Id NOT in _animatedEntities. Its MeshRefs is
  the same instance from spawn until rare events (ObjDesc / palette
  swap / part hide / scale apply).
- Recommended cache approach: static-only with explicit invalidation
  hooks on the network/spawn-time write sites enumerated in the doc.

Doc covers: where main is, what shipped this session, why the first
Tier 1 attempt failed, the pre-started audit, cache design options,
acceptance criteria, files to read, workflow for the next session, and
things-to-NOT-do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 16:14:24 +02:00

18 KiB
Raw Permalink Blame History

Phase Post-A.5 — Tier 1 Retry (ISSUE #53) — Cold-Start Handoff

Created: 2026-05-10, immediately after closing ISSUES #52 (lifestone) + #54 (JobKind plumbing) and merging to main. Audience: the next agent picking up Priority 3 of the Post-A.5 polish phase. Purpose: drop straight into the Tier 1 entity-classification cache retry without re-litigating what the prior session settled.


TL;DR

Post-A.5 polish was sized at three priorities. 2 of 3 shipped to main during the 2026-05-10 session; only Priority 3 (Tier 1 retry, ISSUE #53) remains. Tier 1 is the biggest perf headroom in the post-A.5 phase: it should drop the entity dispatcher cpu_us median from ~3.5 ms to ~1-1.5 ms, putting the dispatcher inside the spec's 2.0 ms budget and unlocking ~300-400 FPS at standstill.

The first Tier 1 attempt (commit 3639a6f, reverted at 9b49009) broke animation. The next attempt MUST start with an animation-mutation audit. This handoff has the audit pre-started — there's specific evidence captured below that the previous handoff didn't have.

Sized: ~5-7 days including audit + design + spec + implementation + visual gate.


Where main is

  • main HEAD: da08490 — Merge of claude/cranky-varahamihira-fe423f. Includes the lifestone fix + JobKind plumbing.
  • CLAUDE.md "Currently in flight" updated to "Post-A.5 polish — Tier 1 retry (only remaining priority)".
  • docs/ISSUES.md has both #52 and #54 in Recently closed with full root-cause writeups; only #53 remains in Active issues.
  • N.5b conformance sentinel: 94/94. Full suite: 1688/1696 passing (8 pre-existing physics/input failures unchanged across all session work).

Recent commit chain on main (newest first):

SHA Subject
da08490 Merge branch 'claude/cranky-varahamihira-fe423f' — Post-A.5 polish: close #52 (lifestone) + #54 (JobKind plumbing)
9a55354 docs(post-A.5 #54): close JobKind plumbing issue + update CLAUDE.md flight status
bf31e59 fix(streaming): close #54 — plumb JobKind through BuildLandblockForStreaming
b19f1d1 docs(post-A.5 #52): close lifestone issue + update CLAUDE.md flight status
e40159f fix(render): close #52 — lifestone visible (alpha-test + cull + uDrawIDOffset)
c111312 docs(post-A.5): cold-start handoff for the next session (the prior handoff this work used)

What shipped this session

Priority 1 — ISSUE #52 (lifestone missing) — closed by e40159f

Three independent root causes regressed with the WB rendering migration (Phase N.5 retirement amendment, commit dcae2b6, 2026-05-08):

  1. Alpha-test discard in mesh_modern.frag transparent pass killed high-α pixels of dat-flagged transparent surfaces. The lifestone crystal core (surface 0x080011DE) decoded with α≥0.95, so 100% of fragments were discarded. Fix: remove α >= 0.95 discard from transparent pass; keep α < 0.05 discard as a fragment-cost optimization.
  2. Cull state regression: WbDrawDispatcher.Draw Phase 8 had no GL cull state — Phase 9.2's Enable(CullFace) + Back + CCW setup (commit 6f1971a, 2026-04-11) was lost when the legacy StaticMeshRenderer was deleted. Closed-shell translucents composited back-faces over front-faces in iteration order under DepthMask(false). Fix: re-establish Phase 9.2's GL state at the top of Phase 8.
  3. uDrawIDOffset indexing bug: gl_DrawIDARB resets to 0 at the start of each glMultiDrawElementsIndirect, so the transparent pass was reading Batches[0..transparentCount) (the OPAQUE section) instead of Batches[opaqueCount..end). The lifestone flickered to whatever opaque batch sorted to index 0 each frame. Fix: add uniform int uDrawIDOffset to mesh_modern.vert, set per-pass in dispatcher (0 for opaque, _opaqueDrawCount for transparent). Mirrors WB's BaseObjectRenderManager.cs:845.

User-confirmed visually via +Acdream test character at the Holtburg outdoor lifestone (Z=94 platform).

Priority 2 — ISSUE #54 (JobKind plumbing) — closed by bf31e59

LandblockStreamer.cs primary ctor signature changed from Func<uint, LoadedLandblock?> to Func<uint, LandblockStreamJobKind, LoadedLandblock?>. A back-compat overload preserves the old signature for the 5 ctor sites in LandblockStreamerTests.cs (no test changes needed). BuildLandblockForStreaming(uint, JobKind) in GameWindow.cs early-outs for LoadFar with a heightmap-only path. The Bug A post-load entity strip in LandblockStreamer.HandleJob is retained as a Debug.Assert + Release safety net.

Per-LB worker cost on far-tier dropped from ~tens of ms (full hydration including LandBlockInfo + SceneryGenerator + interior cells) to ~sub-ms (single LandBlock dat read).

Memory entry from this session

feedback_wb_migration_state_audit.md — captures the meta-lesson that WB-migration phases need a systematic GL-state and shader-uniform diff vs the legacy renderer being replaced. Future phases at risk: Sky/Particles modern path migration, EnvCell modern path, Shadow mapping. Also captures the workflow lesson: when the user says "we had this nailed down before", the first move is git log -- <legacy file> BEFORE adding new diagnostic instrumentation.


Priority 3 — ISSUE #53 — Tier 1 entity-classification cache retry

What the first attempt was and why it failed

Commit 3639a6f (reverted at 9b49009) cached meshRef.PartTransform baked into per-(entity, batch) classification at first-frame visit. For static entities this is stable; for animated entities the cache froze the pose and NPCs/players stopped animating. Some buildings also showed at wrong positions (likely entities incorrectly flagged as animated).

The "trust MeshRefs as the source of truth" comment in the dispatcher gave false confidence. MeshRefs IS the source of truth, but it's mutated EVERY frame for animated entities.

The audit (PRE-STARTED in the prior session — read this carefully)

The previous handoff and ISSUE #53 describe the bug as "AnimationSequencer mutates meshRef.PartTransform every frame to apply the current skeletal pose." That framing is technically wrong in a way that matters for the retry design. Discovered during the post-A.5 lifestone session:

  • MeshRef at src/AcDream.Core/World/MeshRef.cs:15 is a readonly record struct — its fields cannot be mutated in place:
    public readonly record struct MeshRef(uint GfxObjId, Matrix4x4 PartTransform)
    
  • The actual per-frame mutation for animated entities is the entire MeshRefs LIST replacement at src/AcDream.App/Rendering/GameWindow.cs:7474-7553:
    var newMeshRefs = new List<AcDream.Core.World.MeshRef>(partCount);
    // ... loop building per-part transforms from sequencer.Advance(dt) ...
    ae.Entity.MeshRefs = newMeshRefs;
    
  • The source of truth for which entities go through that per-frame path is the _animatedEntities dictionary at GameWindow.cs:160:
    private readonly Dictionary<uint, AnimatedEntity> _animatedEntities = new();
    
    Population: _animatedEntities[entity.Id] = new AnimatedEntity{...} at GameWindow.cs:2724 (spawn). Removal: _animatedEntities.Remove(...) at GameWindow.cs:2935 (despawn).

Therefore: a static entity is one whose Id is NOT in _animatedEntities. Its MeshRefs list is the same instance from spawn until rare events (ObjDesc / palette swap / part hide). Other static-entity write sites that must be invalidation-aware:

  • src/AcDream.App/Rendering/GameWindow.cs:2333 and :2365 — ObjDescEvent / AnimPartChange events rebuild a MeshRef element. Network-driven, infrequent.
  • src/AcDream.App/Rendering/GameWindow.cs:2524 — entity scale apply at spawn (one-shot).
  • Lines 4682-4924, 4996-5074 — dat-side hydration paths in OnLoad / scenery / interior. Spawn-time only.

What this means for cache design

The cleanest design is now clearer than the original handoff suggested:

Recommended approach (option a from the original handoff): static-only cache with explicit invalidation hooks.

  1. Cache the (entity, batch) → InstanceGroup-key + model-matrix mapping for entities where _animatedEntities.ContainsKey(entity.Id) == false.
  2. Animated entities skip the cache entirely; they go through today's per-frame ClassifyBatches path.
  3. Invalidate the cache for an entity on:
    • ObjDesc / AnimPartChange events (GameWindow.cs:2333, 2365) — rebuild that entity's cache entry.
    • Palette override changes (rare; usually only on initial server spawn or a re-equip event).
    • Entity despawn — drop the cache entry.
  4. Static entities never animate. The dispatcher's per-frame work for cached entities reduces from "walk + classify all batches" to "walk + lookup-and-emit-pre-classified".

Why this is safer than the first attempt: the first attempt cached the POSE (model matrix). This attempt would cache only the (group key, texture handle, blend mode, per-part meshRef.PartTransform * entityWorld for the spawn-time stable subset). Animation never enters the cache surface.

Cache design options reconsidered

(a) Static-only cache (recommended). As described above. Clean invariant: animated entities skip the cache; static entities go through it. Requires careful enumeration of all writes to entity.MeshRefs for static entities (see audit list above) so each one fires invalidation.

(b) Dynamic-aware cache with invalidation hooks. Cache everything but expose InvalidateEntity(uint) / RefreshEntityPalette(uint) hooks; wire from network handlers. More complex but might let some animated entities also benefit if their per-frame mutations are localized. NOT RECOMMENDED for a first retry — error-prone and the first attempt already failed at this scope.

(c) Static-only + animated-bypass + DEBUG cross-check. Like (a), but in DEBUG builds, log a warning every frame if a cached entity's MeshRefs reference no longer matches the cached snapshot (catches mis-classified dynamics). Belt-and-suspenders. Recommended IF you're nervous about the audit being incomplete.

Acceptance criteria (from the original handoff, refined)

  • Build green; existing 999+ tests pass; 8 pre-existing physics/input failures stay at 8.
  • 1-3 new tests covering: cache hit for static entity (lookup), cache bypass for animated entity (no-op), cache invalidation on entity despawn, cache invalidation on ObjDesc/palette event.
  • N.5b conformance sentinel intact (89+ tests; in this session it's 94/94 — must stay clean).
  • Visual gate: launch + walk Holtburg → North Yanshi at horizon-safe preset; confirm:
    • Animation works (NPCs, player character animate normally — including the lifestone crystal closed by #52).
    • Buildings at correct positions.
    • No new visual regressions.
  • Perf gate (with [WB-DIAG] under ACDREAM_WB_DIAG=1):
    • Entity dispatcher cpu_us median drops from ~3.5 ms to ≤2.0 ms (matches spec budget).
    • p95 stays ≤2.5 ms.

Files to read before brainstorming

In rough order:

  1. This handoff end-to-end — captures audit insights from the prior session that the original handoff didn't have.
  2. docs/research/2026-05-10-post-a5-polish-handoff.md — the prior handoff. §"Priority 3" has the original (slightly outdated) framing of the bug. Read for context but trust THIS handoff's audit insights over its.
  3. docs/ISSUES.md issue #53 — the issue's own description (now updated post-#52/#54 close).
  4. docs/superpowers/specs/2026-05-09-phase-a5-two-tier-streaming-design.md — A.5 spec for the entity dispatcher's data-flow context (esp. §4.10 Quality Preset and §11 deferred items).
  5. docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md — the perf-tier roadmap. Tier 1 is in scope; Tier 2 + Tier 3 are explicitly NOT (those are dedicated multi-week phases).
  6. memory/feedback_wb_migration_state_audit.md — the new memory entry on WB migration state-loss patterns. Tier 1 doesn't touch the WB migration directly, but the meta-lesson "audit before assume" is exactly what this priority needs.
  7. memory/project_phase_a5_state.md — the 5 gotchas. Critical for avoiding the same traps, especially #3 (caching mutable per-frame state breaks animation silently) — the exact bug the first Tier 1 attempt hit.
  8. src/AcDream.Core/World/MeshRef.cs — confirm the readonly record struct shape; understand that "mutating PartTransform" actually means "replacing the whole MeshRef record."
  9. src/AcDream.App/Rendering/GameWindow.cs:7340-7560 — the per-frame animation rebuild loop. Read this end-to-end for the audit. Find every line that writes to entity.MeshRefs for animated entities.
  10. src/AcDream.App/Rendering/GameWindow.cs:160 + lines 2710-2760, 2920-2940_animatedEntities declaration + spawn/despawn population.
  11. src/AcDream.App/Rendering/Wb/WbDrawDispatcher.csDraw and ClassifyBatches. Where the cache will land.
  12. src/AcDream.Core/Physics/AnimationSequencer.cs — the per-frame animation engine. Audit any field it mutates that the dispatcher reads.
  13. src/AcDream.Core/Physics/AnimationHookRouter.cs — secondary mutation source via animation hooks.

Workflow for the next session

  1. Read this handoff in full.
  2. Verify build green: dotnet build. Verify ~1688 tests pass: dotnet test --no-build. Verify N.5b sentinel: filter TerrainSlot|TerrainModernConformance|Wb|MatrixComposition|TextureCacheBindless|SplitFormulaDivergence → expect 94 passing.
  3. Read the files above in order. Especially deep on §"Files to read" #8-#13.
  4. Audit step (1-2 days): open a fresh research note docs/research/2026-05-10-tier1-mutation-audit.md and write down:
    • Every code path that writes entity.MeshRefs = ... for any entity.
    • Tag each as STATIC (one-shot at spawn or rare event) or DYNAMIC (per-frame).
    • For each STATIC write, identify the trigger (network event, scale apply, etc.) and design the invalidation hook.
    • For each DYNAMIC write, confirm it fires only for entities in _animatedEntities (which means cache bypass is the right answer).
  5. Spec (~1 day): brainstorm the cache design with the user (use superpowers:brainstorming). Write docs/superpowers/specs/2026-05-10-issue-53-tier1-cache-design.md. Include the audit findings, the chosen cache approach (probably option (a)), the invariants, the invalidation API, the test plan, the perf-gate measurement plan.
  6. Implement (~2-3 days): TDD via superpowers:test-driven-development. Tests first for cache hit/miss/invalidation, then implementation in WbDrawDispatcher. Wire invalidation hooks into the relevant write sites in GameWindow.cs.
  7. Visual gate: launch + walk; confirm animation works on a moving NPC; confirm static buildings/scenery still render at correct positions; confirm lifestone (closed by #52) still renders.
  8. Perf gate: capture [WB-DIAG] cpu_us median + p95 with ACDREAM_WB_DIAG=1 at horizon-safe preset (NEAR=4, FAR=12). Compare to today's ~3.5 ms baseline; expect ≤2.0 ms.
  9. Ship: commit, close #53 in ISSUES.md, update CLAUDE.md "Currently in flight" (this would close out the post-A.5 polish phase entirely), update memory with any new gotchas captured during the audit/implementation.
  10. Next phase after #53 ships: N.6 (perf polish) per the roadmap. Or escalate to Tier 2 (static/dynamic split with persistent groups) per docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md if Tier 1 alone doesn't hit the perf target.

Things to NOT do

  • Don't skip the audit. The whole reason the first attempt failed was that the audit was implicit and incomplete. The audit step should produce a written list of every MeshRefs write site, classified static vs dynamic, before any cache code is written.
  • Don't bundle Tier 2 or Tier 3 into this phase. Those are dedicated multi-week phases per docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md. If the audit reveals Tier 1 alone can't hit the perf target, file a follow-up issue and escalate as a separate phase.
  • Don't re-add the Tier1 cache that was reverted. Start fresh after the audit. Cherry-picking commit 3639a6f reintroduces the animation freeze.
  • Don't break the N.5b conformance sentinel. Run the filter on every commit:
    dotnet test --no-build --filter "FullyQualifiedName~TerrainSlot|FullyQualifiedName~TerrainModernConformance|FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition|FullyQualifiedName~TextureCacheBindless|FullyQualifiedName~SplitFormulaDivergence"
    
    Expect 94 passing, 0 failures.
  • Don't skip the visual gate. Animation has been the highest-risk regression in this codebase repeatedly (Tier 1 first attempt, the lifestone crystal in this session, the foundry statue earlier). Confirm visually with a moving animated NPC, a stationary building, and the lifestone before declaring done.
  • Don't trust "it was working in prod before." That was the first Tier 1 attempt's posture. The audit is what makes it actually safe.

Reference: Tier 1 perf math

Per the perf-tier roadmap and A.5 final state:

  • Today (post-A.5 ship + #52/#54): entity dispatcher cpu_us median ~3.5 ms at radius=12 on Radeon RX 9070 XT @ 1440p. ~200-240 FPS at standstill.
  • After Tier 1: ~1.0-1.5 ms median expected. ~300-400 FPS at standstill. Inside the spec's 2.0 ms budget.
  • After Tier 2 (separate phase): ~0.5-1.0 ms. ~400-600 FPS.
  • After Tier 3 (GPU compute culling, separate phase): ~0.05 ms. ~600-1000+ FPS.

Tier 1 is the lowest-risk, highest-leverage perf win remaining for the post-A.5 polish phase.


Good luck. The audit is the load-bearing thing — invest in it. The implementation is mechanical once the audit is solid.

Holler at the user if any of the audit reveals a write site that doesn't fit the static/dynamic dichotomy cleanly.