After the post-A.5 lifestone (#52) + JobKind plumbing (#54) work shipped, only Priority 3 (Tier 1 entity-classification cache retry, ISSUE #53) remains. This handoff captures the audit insights gathered during the #52 investigation that the original post-A.5 handoff didn't have: - MeshRef is a `readonly record struct` — its fields can NOT be mutated in place. The actual per-frame mutation for animated entities is the entire MeshRefs LIST replacement at GameWindow.cs:7474-7553. This reframes the cache design. - _animatedEntities dict at GameWindow.cs:160 is the source of truth for which entities go through the per-frame rebuild path. - Static entity = entity.Id NOT in _animatedEntities. Its MeshRefs is the same instance from spawn until rare events (ObjDesc / palette swap / part hide / scale apply). - Recommended cache approach: static-only with explicit invalidation hooks on the network/spawn-time write sites enumerated in the doc. Doc covers: where main is, what shipped this session, why the first Tier 1 attempt failed, the pre-started audit, cache design options, acceptance criteria, files to read, workflow for the next session, and things-to-NOT-do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
203 lines
18 KiB
Markdown
203 lines
18 KiB
Markdown
# Phase Post-A.5 — Tier 1 Retry (ISSUE #53) — Cold-Start Handoff
|
||
|
||
**Created:** 2026-05-10, immediately after closing ISSUES #52 (lifestone) + #54 (JobKind plumbing) and merging to main.
|
||
**Audience:** the next agent picking up Priority 3 of the Post-A.5 polish phase.
|
||
**Purpose:** drop straight into the Tier 1 entity-classification cache retry without re-litigating what the prior session settled.
|
||
|
||
---
|
||
|
||
## TL;DR
|
||
|
||
Post-A.5 polish was sized at three priorities. **2 of 3 shipped to main** during the 2026-05-10 session; **only Priority 3 (Tier 1 retry, ISSUE #53) remains.** Tier 1 is the biggest perf headroom in the post-A.5 phase: it should drop the entity dispatcher cpu_us median from ~3.5 ms to ~1-1.5 ms, putting the dispatcher inside the spec's 2.0 ms budget and unlocking ~300-400 FPS at standstill.
|
||
|
||
The first Tier 1 attempt (commit `3639a6f`, reverted at `9b49009`) broke animation. The next attempt MUST start with an animation-mutation audit. **This handoff has the audit pre-started** — there's specific evidence captured below that the previous handoff didn't have.
|
||
|
||
Sized: ~5-7 days including audit + design + spec + implementation + visual gate.
|
||
|
||
---
|
||
|
||
## Where main is
|
||
|
||
- **`main` HEAD: `da08490`** — Merge of `claude/cranky-varahamihira-fe423f`. Includes the lifestone fix + JobKind plumbing.
|
||
- **CLAUDE.md "Currently in flight"** updated to *"Post-A.5 polish — Tier 1 retry (only remaining priority)"*.
|
||
- **`docs/ISSUES.md`** has both #52 and #54 in *Recently closed* with full root-cause writeups; only #53 remains in *Active issues*.
|
||
- **N.5b conformance sentinel: 94/94.** Full suite: 1688/1696 passing (8 pre-existing physics/input failures unchanged across all session work).
|
||
|
||
Recent commit chain on main (newest first):
|
||
|
||
| SHA | Subject |
|
||
|---|---|
|
||
| `da08490` | Merge branch 'claude/cranky-varahamihira-fe423f' — Post-A.5 polish: close #52 (lifestone) + #54 (JobKind plumbing) |
|
||
| `9a55354` | docs(post-A.5 #54): close JobKind plumbing issue + update CLAUDE.md flight status |
|
||
| `bf31e59` | fix(streaming): close #54 — plumb JobKind through BuildLandblockForStreaming |
|
||
| `b19f1d1` | docs(post-A.5 #52): close lifestone issue + update CLAUDE.md flight status |
|
||
| `e40159f` | fix(render): close #52 — lifestone visible (alpha-test + cull + uDrawIDOffset) |
|
||
| `c111312` | docs(post-A.5): cold-start handoff for the next session (the prior handoff this work used) |
|
||
|
||
---
|
||
|
||
## What shipped this session
|
||
|
||
### Priority 1 — ISSUE #52 (lifestone missing) — closed by `e40159f`
|
||
|
||
Three independent root causes regressed with the WB rendering migration (Phase N.5 retirement amendment, commit `dcae2b6`, 2026-05-08):
|
||
|
||
1. **Alpha-test discard** in `mesh_modern.frag` transparent pass killed high-α pixels of dat-flagged transparent surfaces. The lifestone crystal core (surface `0x080011DE`) decoded with α≥0.95, so 100% of fragments were discarded. Fix: remove `α >= 0.95 discard` from transparent pass; keep `α < 0.05 discard` as a fragment-cost optimization.
|
||
2. **Cull state regression**: `WbDrawDispatcher.Draw` Phase 8 had no GL cull state — Phase 9.2's `Enable(CullFace) + Back + CCW` setup (commit `6f1971a`, 2026-04-11) was lost when the legacy `StaticMeshRenderer` was deleted. Closed-shell translucents composited back-faces over front-faces in iteration order under `DepthMask(false)`. Fix: re-establish Phase 9.2's GL state at the top of Phase 8.
|
||
3. **`uDrawIDOffset` indexing bug**: `gl_DrawIDARB` resets to 0 at the start of each `glMultiDrawElementsIndirect`, so the transparent pass was reading `Batches[0..transparentCount)` (the OPAQUE section) instead of `Batches[opaqueCount..end)`. The lifestone flickered to whatever opaque batch sorted to index 0 each frame. Fix: add `uniform int uDrawIDOffset` to `mesh_modern.vert`, set per-pass in dispatcher (0 for opaque, `_opaqueDrawCount` for transparent). Mirrors WB's `BaseObjectRenderManager.cs:845`.
|
||
|
||
User-confirmed visually via `+Acdream` test character at the Holtburg outdoor lifestone (Z=94 platform).
|
||
|
||
### Priority 2 — ISSUE #54 (JobKind plumbing) — closed by `bf31e59`
|
||
|
||
`LandblockStreamer.cs` primary ctor signature changed from `Func<uint, LoadedLandblock?>` to `Func<uint, LandblockStreamJobKind, LoadedLandblock?>`. A back-compat overload preserves the old signature for the 5 ctor sites in `LandblockStreamerTests.cs` (no test changes needed). `BuildLandblockForStreaming(uint, JobKind)` in `GameWindow.cs` early-outs for `LoadFar` with a heightmap-only path. The Bug A post-load entity strip in `LandblockStreamer.HandleJob` is retained as a `Debug.Assert` + Release safety net.
|
||
|
||
Per-LB worker cost on far-tier dropped from ~tens of ms (full hydration including `LandBlockInfo` + `SceneryGenerator` + interior cells) to ~sub-ms (single `LandBlock` dat read).
|
||
|
||
### Memory entry from this session
|
||
|
||
`feedback_wb_migration_state_audit.md` — captures the meta-lesson that WB-migration phases need a systematic GL-state and shader-uniform diff vs the legacy renderer being replaced. Future phases at risk: Sky/Particles modern path migration, EnvCell modern path, Shadow mapping. Also captures the workflow lesson: when the user says *"we had this nailed down before"*, the first move is `git log -- <legacy file>` BEFORE adding new diagnostic instrumentation.
|
||
|
||
---
|
||
|
||
## Priority 3 — ISSUE #53 — Tier 1 entity-classification cache retry
|
||
|
||
### What the first attempt was and why it failed
|
||
|
||
Commit `3639a6f` (reverted at `9b49009`) cached `meshRef.PartTransform` baked into per-(entity, batch) classification at first-frame visit. For static entities this is stable; for animated entities the cache froze the pose and NPCs/players stopped animating. Some buildings also showed at wrong positions (likely entities incorrectly flagged as animated).
|
||
|
||
The "trust MeshRefs as the source of truth" comment in the dispatcher gave false confidence. MeshRefs IS the source of truth, but it's mutated EVERY frame for animated entities.
|
||
|
||
### The audit (PRE-STARTED in the prior session — read this carefully)
|
||
|
||
The previous handoff and ISSUE #53 describe the bug as *"AnimationSequencer mutates `meshRef.PartTransform` every frame to apply the current skeletal pose."* **That framing is technically wrong** in a way that matters for the retry design. Discovered during the post-A.5 lifestone session:
|
||
|
||
- `MeshRef` at `src/AcDream.Core/World/MeshRef.cs:15` is a `readonly record struct` — its fields **cannot be mutated in place**:
|
||
```csharp
|
||
public readonly record struct MeshRef(uint GfxObjId, Matrix4x4 PartTransform)
|
||
```
|
||
- The actual per-frame mutation for animated entities is the **entire `MeshRefs` LIST replacement** at `src/AcDream.App/Rendering/GameWindow.cs:7474-7553`:
|
||
```csharp
|
||
var newMeshRefs = new List<AcDream.Core.World.MeshRef>(partCount);
|
||
// ... loop building per-part transforms from sequencer.Advance(dt) ...
|
||
ae.Entity.MeshRefs = newMeshRefs;
|
||
```
|
||
- The source of truth for *which* entities go through that per-frame path is the `_animatedEntities` dictionary at `GameWindow.cs:160`:
|
||
```csharp
|
||
private readonly Dictionary<uint, AnimatedEntity> _animatedEntities = new();
|
||
```
|
||
Population: `_animatedEntities[entity.Id] = new AnimatedEntity{...}` at GameWindow.cs:2724 (spawn). Removal: `_animatedEntities.Remove(...)` at GameWindow.cs:2935 (despawn).
|
||
|
||
**Therefore: a static entity is one whose `Id` is NOT in `_animatedEntities`.** Its MeshRefs list is the same instance from spawn until rare events (ObjDesc / palette swap / part hide). Other static-entity write sites that must be invalidation-aware:
|
||
- `src/AcDream.App/Rendering/GameWindow.cs:2333` and `:2365` — ObjDescEvent / AnimPartChange events rebuild a `MeshRef` element. Network-driven, infrequent.
|
||
- `src/AcDream.App/Rendering/GameWindow.cs:2524` — entity scale apply at spawn (one-shot).
|
||
- Lines 4682-4924, 4996-5074 — dat-side hydration paths in OnLoad / scenery / interior. Spawn-time only.
|
||
|
||
### What this means for cache design
|
||
|
||
The cleanest design is now clearer than the original handoff suggested:
|
||
|
||
**Recommended approach (option a from the original handoff): static-only cache with explicit invalidation hooks.**
|
||
|
||
1. Cache the (entity, batch) → InstanceGroup-key + model-matrix mapping for entities where `_animatedEntities.ContainsKey(entity.Id) == false`.
|
||
2. Animated entities skip the cache entirely; they go through today's per-frame `ClassifyBatches` path.
|
||
3. Invalidate the cache for an entity on:
|
||
- **ObjDesc / AnimPartChange events** (`GameWindow.cs:2333, 2365`) — rebuild that entity's cache entry.
|
||
- **Palette override changes** (rare; usually only on initial server spawn or a re-equip event).
|
||
- **Entity despawn** — drop the cache entry.
|
||
4. Static entities never animate. The dispatcher's per-frame work for cached entities reduces from "walk + classify all batches" to "walk + lookup-and-emit-pre-classified".
|
||
|
||
Why this is safer than the first attempt: the first attempt cached the POSE (model matrix). This attempt would cache only the (group key, texture handle, blend mode, per-part `meshRef.PartTransform * entityWorld` for the spawn-time stable subset). Animation never enters the cache surface.
|
||
|
||
### Cache design options reconsidered
|
||
|
||
(a) **Static-only cache (recommended).** As described above. Clean invariant: animated entities skip the cache; static entities go through it. Requires careful enumeration of all writes to `entity.MeshRefs` for static entities (see audit list above) so each one fires invalidation.
|
||
|
||
(b) **Dynamic-aware cache with invalidation hooks.** Cache everything but expose `InvalidateEntity(uint)` / `RefreshEntityPalette(uint)` hooks; wire from network handlers. More complex but might let some animated entities also benefit if their per-frame mutations are localized. NOT RECOMMENDED for a first retry — error-prone and the first attempt already failed at this scope.
|
||
|
||
(c) **Static-only + animated-bypass + DEBUG cross-check.** Like (a), but in DEBUG builds, log a warning every frame if a cached entity's `MeshRefs` reference no longer matches the cached snapshot (catches mis-classified dynamics). Belt-and-suspenders. Recommended IF you're nervous about the audit being incomplete.
|
||
|
||
### Acceptance criteria (from the original handoff, refined)
|
||
|
||
- Build green; existing 999+ tests pass; 8 pre-existing physics/input failures stay at 8.
|
||
- 1-3 new tests covering: cache hit for static entity (lookup), cache bypass for animated entity (no-op), cache invalidation on entity despawn, cache invalidation on ObjDesc/palette event.
|
||
- N.5b conformance sentinel intact (89+ tests; in this session it's 94/94 — must stay clean).
|
||
- Visual gate: launch + walk Holtburg → North Yanshi at horizon-safe preset; confirm:
|
||
- Animation works (NPCs, player character animate normally — including the lifestone crystal closed by #52).
|
||
- Buildings at correct positions.
|
||
- No new visual regressions.
|
||
- Perf gate (with `[WB-DIAG]` under `ACDREAM_WB_DIAG=1`):
|
||
- Entity dispatcher cpu_us median drops from ~3.5 ms to ≤2.0 ms (matches spec budget).
|
||
- p95 stays ≤2.5 ms.
|
||
|
||
---
|
||
|
||
## Files to read before brainstorming
|
||
|
||
In rough order:
|
||
|
||
1. **This handoff** end-to-end — captures audit insights from the prior session that the original handoff didn't have.
|
||
2. **`docs/research/2026-05-10-post-a5-polish-handoff.md`** — the prior handoff. §"Priority 3" has the original (slightly outdated) framing of the bug. Read for context but trust THIS handoff's audit insights over its.
|
||
3. **`docs/ISSUES.md` issue #53** — the issue's own description (now updated post-#52/#54 close).
|
||
4. **`docs/superpowers/specs/2026-05-09-phase-a5-two-tier-streaming-design.md`** — A.5 spec for the entity dispatcher's data-flow context (esp. §4.10 Quality Preset and §11 deferred items).
|
||
5. **`docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md`** — the perf-tier roadmap. Tier 1 is in scope; Tier 2 + Tier 3 are explicitly NOT (those are dedicated multi-week phases).
|
||
6. **`memory/feedback_wb_migration_state_audit.md`** — the new memory entry on WB migration state-loss patterns. Tier 1 doesn't touch the WB migration directly, but the meta-lesson "audit before assume" is exactly what this priority needs.
|
||
7. **`memory/project_phase_a5_state.md`** — the 5 gotchas. **Critical for avoiding the same traps**, especially #3 (caching mutable per-frame state breaks animation silently) — the exact bug the first Tier 1 attempt hit.
|
||
8. **`src/AcDream.Core/World/MeshRef.cs`** — confirm the `readonly record struct` shape; understand that "mutating PartTransform" actually means "replacing the whole MeshRef record."
|
||
9. **`src/AcDream.App/Rendering/GameWindow.cs:7340-7560`** — the per-frame animation rebuild loop. Read this end-to-end for the audit. Find every line that writes to `entity.MeshRefs` for animated entities.
|
||
10. **`src/AcDream.App/Rendering/GameWindow.cs:160` + lines 2710-2760, 2920-2940** — `_animatedEntities` declaration + spawn/despawn population.
|
||
11. **`src/AcDream.App/Rendering/Wb/WbDrawDispatcher.cs`** — `Draw` and `ClassifyBatches`. Where the cache will land.
|
||
12. **`src/AcDream.Core/Physics/AnimationSequencer.cs`** — the per-frame animation engine. Audit any field it mutates that the dispatcher reads.
|
||
13. **`src/AcDream.Core/Physics/AnimationHookRouter.cs`** — secondary mutation source via animation hooks.
|
||
|
||
---
|
||
|
||
## Workflow for the next session
|
||
|
||
1. **Read this handoff in full.**
|
||
2. **Verify build green:** `dotnet build`. Verify ~1688 tests pass: `dotnet test --no-build`. Verify N.5b sentinel: filter `TerrainSlot|TerrainModernConformance|Wb|MatrixComposition|TextureCacheBindless|SplitFormulaDivergence` → expect 94 passing.
|
||
3. **Read the files above** in order. Especially deep on §"Files to read" #8-#13.
|
||
4. **Audit step (1-2 days):** open a fresh research note `docs/research/2026-05-10-tier1-mutation-audit.md` and write down:
|
||
- Every code path that writes `entity.MeshRefs = ...` for any entity.
|
||
- Tag each as **STATIC** (one-shot at spawn or rare event) or **DYNAMIC** (per-frame).
|
||
- For each STATIC write, identify the trigger (network event, scale apply, etc.) and design the invalidation hook.
|
||
- For each DYNAMIC write, confirm it fires only for entities in `_animatedEntities` (which means cache bypass is the right answer).
|
||
5. **Spec (~1 day):** brainstorm the cache design with the user (use `superpowers:brainstorming`). Write `docs/superpowers/specs/2026-05-10-issue-53-tier1-cache-design.md`. Include the audit findings, the chosen cache approach (probably option (a)), the invariants, the invalidation API, the test plan, the perf-gate measurement plan.
|
||
6. **Implement (~2-3 days):** TDD via `superpowers:test-driven-development`. Tests first for cache hit/miss/invalidation, then implementation in `WbDrawDispatcher`. Wire invalidation hooks into the relevant write sites in `GameWindow.cs`.
|
||
7. **Visual gate:** launch + walk; confirm animation works on a moving NPC; confirm static buildings/scenery still render at correct positions; confirm lifestone (closed by #52) still renders.
|
||
8. **Perf gate:** capture `[WB-DIAG]` cpu_us median + p95 with `ACDREAM_WB_DIAG=1` at horizon-safe preset (NEAR=4, FAR=12). Compare to today's ~3.5 ms baseline; expect ≤2.0 ms.
|
||
9. **Ship:** commit, close #53 in ISSUES.md, update CLAUDE.md "Currently in flight" (this would close out the post-A.5 polish phase entirely), update memory with any new gotchas captured during the audit/implementation.
|
||
10. **Next phase after #53 ships:** N.6 (perf polish) per the roadmap. Or escalate to Tier 2 (static/dynamic split with persistent groups) per `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md` if Tier 1 alone doesn't hit the perf target.
|
||
|
||
---
|
||
|
||
## Things to NOT do
|
||
|
||
- **Don't skip the audit.** The whole reason the first attempt failed was that the audit was implicit and incomplete. The audit step should produce a written list of every MeshRefs write site, classified static vs dynamic, before any cache code is written.
|
||
- **Don't bundle Tier 2 or Tier 3 into this phase.** Those are dedicated multi-week phases per `docs/plans/2026-05-10-perf-tiers-2-3-roadmap.md`. If the audit reveals Tier 1 alone can't hit the perf target, file a follow-up issue and escalate as a separate phase.
|
||
- **Don't re-add the `Tier1` cache that was reverted.** Start fresh after the audit. Cherry-picking commit `3639a6f` reintroduces the animation freeze.
|
||
- **Don't break the N.5b conformance sentinel.** Run the filter on every commit:
|
||
```
|
||
dotnet test --no-build --filter "FullyQualifiedName~TerrainSlot|FullyQualifiedName~TerrainModernConformance|FullyQualifiedName~Wb|FullyQualifiedName~MatrixComposition|FullyQualifiedName~TextureCacheBindless|FullyQualifiedName~SplitFormulaDivergence"
|
||
```
|
||
Expect 94 passing, 0 failures.
|
||
- **Don't skip the visual gate.** Animation has been the highest-risk regression in this codebase repeatedly (Tier 1 first attempt, the lifestone crystal in this session, the foundry statue earlier). Confirm visually with a moving animated NPC, a stationary building, and the lifestone before declaring done.
|
||
- **Don't trust "it was working in prod before."** That was the first Tier 1 attempt's posture. The audit is what makes it actually safe.
|
||
|
||
---
|
||
|
||
## Reference: Tier 1 perf math
|
||
|
||
Per the perf-tier roadmap and A.5 final state:
|
||
- **Today** (post-A.5 ship + #52/#54): entity dispatcher cpu_us median ~3.5 ms at radius=12 on Radeon RX 9070 XT @ 1440p. ~200-240 FPS at standstill.
|
||
- **After Tier 1**: ~1.0-1.5 ms median expected. ~300-400 FPS at standstill. Inside the spec's 2.0 ms budget.
|
||
- **After Tier 2 (separate phase)**: ~0.5-1.0 ms. ~400-600 FPS.
|
||
- **After Tier 3 (GPU compute culling, separate phase)**: ~0.05 ms. ~600-1000+ FPS.
|
||
|
||
Tier 1 is the lowest-risk, highest-leverage perf win remaining for the post-A.5 polish phase.
|
||
|
||
---
|
||
|
||
Good luck. The audit is the load-bearing thing — invest in it. The implementation is mechanical once the audit is solid.
|
||
|
||
Holler at the user if any of the audit reveals a write site that doesn't fit the static/dynamic dichotomy cleanly.
|