Asheron Call EoR memory leak hunt � fixes + investigation tooling
Find a file
acbot 57b5e43d0e Initial commit — leak-hunt project complete
Five bugs identified and patched in retail Asheron's Call client:
- v3b: palette refcount over-increment (3-byte NOP at two sites)
- v5: RenderSurface PurgeResource no-op stub (vtable slot 2 thunk)
- v11: two dangling-pointer crash guards (NULL-check + reorder)
- v14: CEnvCell::Destroy ClipPlaneList leak (18-byte JMP to cleanup thunk)
- v22: unpacker stale-pointer SEH guard (whole-function __try/__except)

All five ship in leakfix.dll (117 KB, SHA d282f23c…) which is loaded
by acclient.exe at process start via PE import table patching by
tools/install_leakfix.py.

Controlled 15-client fleet soak: unpatched control died at 26h with
palette exhaustion; all 14 patched clients survived past that point
and reached ≥5-day uptime.

Residual ~15 MB/h growth traced to d3d9.dll's internal slab allocator
(260KB surface backing buffers retained after Release). See REPORT.md
§10 for the full investigation; conclusion is that it's unfixable from
outside d3d9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 21:07:58 +02:00
bin Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
dll Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
pdb Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
references Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
templates Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
tools Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
.gitignore Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00
README.md Initial commit — leak-hunt project complete 2026-05-23 21:07:58 +02:00

Retail AC Memory Leak Hunt

Status: COMPLETE 2026-05-22. Five bugs found and patched in the retail AC client. Controlled fleet soak showed the unpatched control died at 26h with palette exhaustion; all 14 patched clients survived past that point and ran for ≥5-day uptime. The residual ~15 MB/h growth was traced to d3d9.dll's internal slab allocator and is unfixable from outside d3d9.

If you just want to install: drop dll/leakfix/dist/leakfix.dll into your AC directory and run python tools/install_leakfix.py "C:\path\to\AC". The installer patches acclient.exe's import table to load leakfix.dll at startup. Idempotent — safe to re-run.


What ships

Patch Bug One-line fix
v3b Palette refcount over-increment in makeModifiedPalette NOP the inc [eax+0x24] at two sites
v5 RenderSurface::PurgeResource is a no-op stub Override vtable slot 2 to call Destroy() for real
v11 Two dangling-pointer dereferences in delete_contents + ~GXTri3Mesh NULL-check guards
v14 CEnvCell::Destroy leaks the ClipPlaneList (just zeros the count) Replace the 18-byte buggy block with a JMP to a thunk that actually frees the list
v22 Server-driven AV in the unpacker function at 0x00526A50 (5-client mass crash 2026-05-21 09:00) Wrap the function in __try / __except, return 0 on AV (which the engine already handles as the size-check-failure code path)

All five plus a crash-handler ship in leakfix.dll. Patches are applied 30 seconds after process start (deferred so Decal/UB win their own init race first). Crash handler is installed immediately so any crashes during the 30s window are still captured.

Patch pseudo-code

v3b — palette refcount over-increment

The engine's palette-cache hit path increments the cached entry's refcount twice (once in the cache lookup, once in the constructor that wraps it). Result: refcount grows monotonically; nothing ever hits zero; palettes accumulate until the 32-bit address space exhausts (~26h on heavy-loot clients).

; at 0x0053EFFE (and 0x0053F19C, the sibling overload)
;   before patch:                            after patch:
;   ff 40 24    inc dword ptr [eax+0x24]     90 90 90  nop nop nop
// effect, expressed in C:
// before: refcount++ twice per cache hit
// after:  refcount++ once per cache hit (the outer increment is removed)

v5 — RenderSurface PurgeResource override

RenderSurface's and RenderTexture's PurgeResource virtual slot points at 0x004154A0, which is mov al, 1; ret — a no-op stub. When the resource manager's purge sweep walks s_Resources and calls PurgeResource() on each entry, the call returns "1 = purged" but the resource's D3D handle + heap state is never touched. Result: purged-shell accumulation in s_Resources.

// before — at slot 2 of the RenderSurface vtable (0x0079A684):
//   PurgeResource = noop_stub;   // 0x004154A0
//   int noop_stub() { return 1; }
//
// after — slot 2 redirected to our thunk in leakfix.dll:
int purge_rendersurface_thunk(RenderSurface* self) {
    RenderSurface::Destroy(self);   // real cleanup
    return 1;                       // engine marks entry purged
}
// same fix mirrored to RenderTexture slot 2 (0x0079C1A0).

v11 — two dangling-pointer crash guards

Two places where the engine dereferences a pointer that's been freed elsewhere. Both manifest as AVs that take the process down.

Site 1delete_contents hash walk (0x00587126): The loop falls through into a dereference of an already-freed bucket node when the bucket chain was rebuilt mid-walk. Fix: retarget the JMP so the freed-bucket branch jumps to the epilogue, skipping the deref.

;   before: eb 07    jmp +0x07   ; into the deref
;   after:  eb 42    jmp +0x42   ; into the epilogue (skip deref)

Site 2~GXTri3Mesh slot 0 deref (0x005E565D): Destructor of GXTri3Mesh reads its slot[0] then zeros it. If slot[0] is stale (some other path already freed it), the deref AVs. Fix: reorder so we zero first; never deref a slot we can't trust.

;   before:                          after:
;   8B 08        mov ecx, [eax]      89 5E 08    mov [esi+8], ebx  ; zero first
;   50           push eax            90 ... 90   nop x6            ; skip deref + call
;   FF 51 08     call [ecx+8]
;   89 5E 08     mov [esi+8], ebx

v14 — CEnvCell::ClipPlaneList leak

CEnvCell::Destroy contains an 18-byte cleanup block that only zeros cplane_num — never frees the underlying ClipPlaneList object hanging off [this+0xDC]. Every cell unload leaks one of these. Replace the broken block with a JMP to a thunk in leakfix.dll that does the real cleanup:

// thunk pseudo-code:
void v14_clipplane_cleanup_thunk(CEnvCell* self) {
    ClipPlaneListWrapper* outer = self->cplane_wrapper;   // [esi+0xDC]
    if (outer) {
        ClipPlaneList* inner = outer->inner;              // [outer+0x0]
        if (inner) {
            inner->~ClipPlaneList();
            operator delete(inner);
        }
        operator delete[](outer);
        self->cplane_wrapper = nullptr;
    }
    // jump back to V14_RESUME_VA (just past the original 18-byte block)
}

v22 — unpacker stale-pointer SEH guard

A small inline unpacker at 0x00526A50 pulls 4 DWORDs from arg1->buffer. On 2026-05-21 the server fed five clients simultaneously a buffer pointing into freed/kernel memory; all five AV'd on the 4th deref. The engine already has a code path for "buffer too small / unpack failed" (line 1 of the function checks a size field and returns 0). We just wrap the whole function body in SEH and route AVs to that same return-0 path.

// 1. Copy the original 73 bytes of the function to executable memory.
// 2. Patch the original entry with JMP rel32 to our wrapper.
int v22_unpacker_wrapper(this, arg1, count) {
    __try {
        return original_copy(this, arg1, count);  // run the real unpacker
    } __except (EXCEPTION_EXECUTE_HANDLER) {
        // log + return 0 (engine treats this as size-check failure)
        return 0;
    }
}

Install

# 1. Copy leakfix.dll into your AC directory
Copy-Item .\dll\leakfix\build\leakfix.dll "C:\Turbine\Asheron's Call\"

# 2. Patch acclient.exe to import leakfix.dll
python tools\install_leakfix.py "C:\Turbine\Asheron's Call"

# 3. Verify
python tools\install_leakfix.py "C:\Turbine\Asheron's Call" verify

The installer adds a .limport PE section to acclient.exe containing the rebuilt import table. It backs up the original to acclient.exe.bare_original on first run, and is idempotent.

Roll back

Copy-Item "C:\Turbine\Asheron's Call\acclient.exe.bare_original" `
          "C:\Turbine\Asheron's Call\acclient.exe" -Force
Remove-Item "C:\Turbine\Asheron's Call\leakfix.dll" -ErrorAction Ignore

Files

  • dll/leakfix/src/ — DLL source (C++ with inline asm for the naked thunks)
  • dll/leakfix/dist/leakfix.dll — current production build (117 KB)
  • dll/leakfix/build.bat — build script (VS 2022 BuildTools required)
  • tools/install_leakfix.py — patches acclient.exe to import leakfix.dll
  • tools/check_acclient_imports.py — verify import table contains leakfix.dll
  • references/ — symbol table, pseudo-C, header for the 2013 client (PDB-backed)

The rest of this document is the original VM operator brief that drove the investigation. Preserved for context but no longer operationally relevant — the hunt is done.


Retail AC Memory Leak Hunt — VM Operator Brief

You are picking this up cold on a freshly-provisioned Windows VM. This document is your full mission brief. Read it end-to-end before running anything, then drive the work autonomously, using ScheduleWakeup (Claude Code) to pace long-running operations between your active turns.


1. Mission

Find and patch a memory leak in the retail Asheron's Call client. The production symptom is a hard crash after ~45 days of continuous play on the End-of-Retail (EoR, ~Jan 2017) client. We don't have symbols for that binary — but we have full PDB symbols for the Sept 2013 v11.4186 client, which almost certainly carries the same leak (AC was in pure maintenance mode 2013→2017, very little net new code).

The hunt happens on the 2013 client (symbolized). The patch ships against the EoR client (via BinDiff-forward).

What "done" looks like

  1. A specific function in the 2013 client is identified as the leak source, with evidence: monotonic UMDH growth across multiple snapshot diffs attributed to that function's call stack.
  2. The corresponding function in the EoR client is located via BinDiff (this step happens on the host machine, not the VM — the BNDB files live there).
  3. A DLL-injection patch is built that hooks the EoR function and plugs the leak (typically: adds a missing delete/Release/decref on a known path).
  4. A 5+ day soak on EoR with the patch installed completes without the OOM crash that reproduces unpatched in the same window.

Hard scope boundary

This is a self-contained side quest. Do not expand it into a general retail-instrumentation framework, a fork of the controller DLL into a fully-featured bot, a parallel acdream feature, or "while I'm here" refactors of the AC2D/Mosswart tooling. Find leak → patch → validate → ship → done. If you catch yourself reaching for adjacent work, stop and re-read this paragraph.


2. Why this works (assumptions you can rely on)

  • Compiler & toolchain stability. 2013 and EoR were both built with the same VC++ family on the same Turbine build farm. Binary structure is highly similar.
  • Code stability. AC went into maintenance after Throne of Destiny (2005) and stayed there. Most of the codebase did not change meaningfully between 2013 and EoR. A leak severe enough to crash in 45 days has almost certainly been present for many years.
  • PDB → BinDiff path is mature. BinDiff and Diaphora routinely achieve 8095% function-match rates across related VC++ binaries. Once you identify the leaking function in 2013 (with name), porting the symbol forward to EoR is signature-scan-able.

What you're betting on, and the fallback

  • Primary bet: the leak repros on the 2013 client. UMDH on the 2013 client + activity bot reveals it within hours-to-days. You identify a named function, hand the name to the host for BinDiff, receive the EoR signature back, build the patch DLL, validate.
  • Fallback: the leak does NOT repro on 2013 — i.e. it was introduced after Sept 2013. In that case, you fall back to hunting on the EoR client without symbols, using BinDiff-transferred names for whatever functions match the 2013 codebase. This is slower but still feasible. The primary-vs-fallback determination is Phase 1 Decision Gate below.

3. Package contents

leak-hunt-vm-2026-05-12/
├── README.md                ← you are here
├── MANIFEST.md              ← list of out-of-repo files copied in
├── CLAUDE.md                ← VM-side project rules (persistent)
├── templates/
│   ├── supervisor.ps1       ← skeleton — start ACE, start client, snapshot loop
│   ├── snapshot.ps1         ← UMDH single-shot
│   ├── activity-phases.json ← phase schedule template
│   ├── login.ahk            ← AutoHotkey login skeleton
│   └── trace.cdb            ← cdb scripting template
├── tools/
│   ├── check_exe_pdb.py     ← verify binary ↔ PDB GUID match
│   ├── dump_pdb_info.py     ← PDB metadata
│   └── pdb_extract.py       ← regenerate symbols.json if needed
├── pdb/
│   └── acclient.pdb         ← (29 MB, copied per MANIFEST)
└── references/
    ├── symbols.json         ← 18,366 named functions + addresses (grep-friendly)
    ├── types.json           ← 5,371 struct/class type definitions
    ├── acclient.h           ← verbatim retail header structs
    └── acclient_2013_pseudo_c.txt  ← 64 MB symbolized Binary Ninja pseudo-C

The Python tools are stdlib-only (no pip). Everything else is data.


4. What you need on the VM (one-time, before starting)

If any of these is missing, ask the user before guessing.

Component Where Notes
Retail AC client (2013 v11.4186) C:\Turbine\Asheron's Call\ Standard install path. Verify match with check_exe_pdb.py before any other work. The _NT_SYMBOL_PATH must include pdb/.
Retail AC dat files inside the install client_portal.dat, client_cell_1.dat, client_highres.dat, client_local_English.dat
ACE server 127.0.0.1:9000 on VM Use ACEmulator from github.com/ACEmulator/ACE. Same config as user's dev box. Confirm it accepts logins before continuing.
Test character on the VM's ACE Suggested name: +Leakhunt. GM-marker + so debug commands are available.
Windows Debugging Tools Microsoft Store WinDbg or Win10/11 SDK Need cdb.exe, umdh.exe, gflags.exe. 32-bit (x86) versions — acclient.exe is 32-bit.
AutoHotkey v2 autohotkey.com For login automation. v2 only — templates assume v2 syntax.
Sysinternals procdump sysinternals.com Crash-dump capture.
MinHook (optional, for patch DLL) github.com/TsudaKageyu/minhook Only needed at Phase 8. Defer.
Shared folder or mounted drive Z:\ or similar For passing snapshots back to host. Configure at VM-setup time.

5. Configuration questions to ask the user at session start

Ask these first, before running anything. They materially affect the harness.

  1. Where is ACE running — same VM (recommended; snapshot-clean) or on the host with VM networking through to it? Default assumption: same VM.
  2. What's the AC install path if it's not the standard C:\Turbine\Asheron's Call\?
  3. Output flow — shared folder path? Or push artifacts to a git branch (e.g. leak-hunt-vm/2026-05-12)? Default: shared folder to Z:\leak-hunt\ on host.
  4. Test character name on the VM ACE? Default: +Leakhunt.
  5. VM specs — RAM and core count? (Affects whether to enable gflags+UST from the start, which costs ~2030% perf.)
  6. EoR binary location on host — confirm the user has it at C:\Users\erikn\source\repos\acdream\refs\acclient-eor-2024-09-11.bndb (Binary Ninja db). This isn't needed on the VM but is critical for Phase 7 BinDiff on the host.
  7. Wake-up cadence preference — do they want you to use ScheduleWakeup for hours-long gaps, or stay continuously active? Default: ScheduleWakeup for any gap > 30 min.

Save the user's answers as memory entries before proceeding past Phase 0 so a future session can pick up cold.


6. Phased plan

Each phase has a goal, commands, decision gate, and estimated time. Don't skip ahead. Don't run multiple phases in parallel until Phase 4.

Phase 0 — Verify the bench (target: 30 min)

Goal: prove the environment can launch AC, log in, and observe memory.

  1. py tools/check_exe_pdb.py "C:\Turbine\Asheron's Call\acclient.exe". Expect: === MATCH: this exe pairs with our acclient.pdb ===. If MISMATCH → stop, ask the user which build they installed.
  2. py tools/dump_pdb_info.py pdb/acclient.pdb. Confirm GUID 9e847e2f-777c-4bd9-886c-22256bb87f32, age 1.
  3. Start ACE locally (dotnet run in the ACE checkout, or ACE.exe if pre-built). Confirm it listens on 127.0.0.1:9000.
  4. Manually launch AC, log in with the test character, walk one step, log out. This proves the bench works before you add instrumentation.
  5. Take a clean Hyper-V / VMware snapshot named bench-verified. The supervisor will revert to this before each run.

Decision gate: can you launch, log in, walk, log out, clean? If no, fix this before anything else. If yes, proceed.


Phase 1 — Idle baseline + decide hunt platform (target: 4 hours)

Goal: does the leak reproduce on the 2013 client when the player sits at the lifestone doing nothing? If yes, primary plan; if no, Phase 2 will find the right activity profile.

  1. Enable heap allocation tagging: gflags /i acclient.exe +ust. This is registry-set; survives reboots. (Disable later with gflags /i acclient.exe -ust.)
  2. Set _NT_SYMBOL_PATH=<vm-path-to-pdb-dir>.
  3. Launch AC via templates/supervisor.ps1 (which sets env and spawns the process). Log in manually OR via templates/login.ahk.
  4. Walk to a quiet spot (lifestone interior, away from spawn). Sit.
  5. templates/snapshot.ps1 -ProcessId <pid> -Out snap_001.txt immediately. Wait 30 min. Take snap_002. Repeat for at least 4 hours (8 snapshots).
  6. umdh -d snap_001 snap_008 -f:diff_idle.txt. Read top 20 growing stacks. Save the diff to Z:\leak-hunt\phase1\.

Decision gate:

  • Total committed memory grew >50 MB over 4h idle? Leak repros at idle. Skip Phase 2, jump to Phase 4 (long-soak idle).
  • Total committed grew 550 MB? Leak may need amplification. Proceed to Phase 2.
  • Total committed grew <5 MB? Leak is activity-specific or doesn't exist on 2013. Proceed to Phase 2.
  • Memory dropped or oscillates around 0? No leak signal at idle. Phase 2 is where you'll find it (or won't).

Record the baseline growth-rate number in memory: leak_hunt_phase1_baseline_mb_per_hour.


Phase 2 — Activity-phase characterization (target: 12 days)

Goal: find which player activity causes the leak. The bot is not yet built — you drive this manually with the activity-phase template running as an AHK macro, or by playing 30-min phases yourself if no bot is available.

The five canonical phases (see templates/activity-phases.json):

  1. idle — stand at lifestone, no input
  2. wander — walk a fixed route around Holtburg
  3. chat — spam say/tell/global chat
  4. target-cycle — Tab through nearby NPCs/mobs, no combat
  5. ui-cycle — open/close inventory, character pane, spells

Procedure per session:

  1. Start fresh from bench-verified snapshot.
  2. Run a single phase for 1 hour with snapshots every 15 min.
  3. umdh -d diff the snapshot pair for that phase.
  4. Record growth-rate per phase to memory.
  5. Repeat for each phase, single VM, single phase per run.

(If user has authorized multiple parallel VMs, run different phases simultaneously instead of sequentially.)

Decision gate: rank phases by growth rate. The top phase is your target for Phase 4 amplification. Save ranking to memory.


Phase 3 — Controller DLL (target: 12 days)

Goal: build a small DLL that drives the leaking phase deterministically and reproducibly, faster than a human can.

Build approach:

  • C++ DLL, 32-bit, compiled against Visual Studio Build Tools.
  • MinHook for function hooking.
  • LoadLibrary'd into acclient.exe via a small launcher EXE (CreateProcess SUSPENDED → WriteProcessMemory of LoadLibrary trampoline → ResumeThread). Standard injection pattern.
  • Hook a frame-loop function from symbols.json — search references/symbols.json for CGameLoop, WorldFilter, Tick, ProcessFrame and pick the highest-frequency one with stable signature.
  • Call retail functions directly via PDB-resolved addresses. Examples: CPhysicsObj::set_velocity, CChatManager::SendSay, CPlayerSystem::SelectTarget. These take a this pointer in ecx (thiscall) — you'll need a small asm trampoline or use __thiscall calling-convention helpers.

The bot's job:

  • Drive the top-ranked Phase 2 activity continuously.
  • Emit a heartbeat to a log file every 30s so the supervisor can detect wedging.
  • Auto-restart self-position-watchdog: if CPhysicsObj::position hasn't changed in 5 min during a movement phase, signal the supervisor to revert and retry.

Reuse opportunity: the user maintains MosswartMassacre and MosswartOverlord — both are AC client DLL-injection projects. Ask the user for read access before designing from scratch; they may have a working injector + MinHook scaffolding you can port from in hours rather than days. Do not assume; ask.

Decision gate: bot runs the leaking phase for 1 hour unattended, emits heartbeats, produces measurable UMDH growth.


Phase 4 — Long-soak with amplification (target: 1248 hours)

Goal: generate a clean signal — one or two leaking call stacks visibly dominate the UMDH diff.

  1. Revert VM to bench-verified.
  2. Launch via supervisor.ps1 with the controller DLL injected.
  3. Snapshot every 15 min for 12+ hours.
  4. umdh -d snap_001 vs snap_N every couple of hours during your active turns. Between active turns, use ScheduleWakeup with delay 18003600s and the reason "long-soak snapshot check".

Decision gate: UMDH diff shows one or more call stacks with monotonic growth across all adjacent-pair diffs, dominating the total by ≥10× over the next-highest. That's your leak candidate(s).


Phase 5 — Identify the leaking function (target: 24 hours)

Goal: convert the UMDH call stack into a named function we can study and patch.

  1. The top growing stack will look like:
    ntdll!RtlAllocateHeap+0x...
    acclient!operator new+0x...
    acclient!CFoo::AllocateBar+0x42
    acclient!CFoo::DoTheThing+0x18
    acclient!CGameLoop::Tick+0x...
    
  2. The named function is CFoo::AllocateBar. Grep references/acclient_2013_pseudo_c.txt for CFoo::AllocateBar to read its body.
  3. Identify the paired free function (CFoo::ReleaseBar, ~CFoo, etc.) and confirm by reading both.
  4. Find every call site of CFoo::AllocateBar (grep the pseudo-C for the function name) and verify each has a matching paired release. The one that doesn't is the bug.

Decision gate: you have (a) the leaking function name, (b) the specific call site that doesn't free, (c) a hypothesis for the patch (typically: add a delete or Release() on a specific code path). Save these to memory + a write-up file.


Phase 6 — Cross-reference with retail debugger trace (optional, target: 2 hours)

Goal: confirm the leak path is actually hit at runtime in a real play scenario, not just statically possible.

This step is optional but recommended if the leak path is conditional (e.g. "only when the chat buffer wraps"). Use the cdb workflow documented in templates/trace.cdb and the retail-debugger section in CLAUDE.md. Attach to acclient.exe, breakpoint on the alloc function with a non-blocking action (r $t0=@$t0+1; gc), let it accumulate for 30 min, count hits, correlate with UMDH growth bytes.


Phase 7 — BinDiff to EoR (HOST MACHINE — not VM)

Goal: produce an EoR-binary signature and offset for the leaking function.

This phase does not happen on the VM. The Binary Ninja databases live on the host (refs/acclient-eor-2024-09-11.bndb, refs/acclient_2013-2024-09-11.bndb).

You (VM Claude) do:

  1. Write a structured handoff file Z:\leak-hunt\phase7-handoff.md containing: the function name, its 2013 RVA, the paired release function name, the suspected missing-free call site, the call-graph context, a 3248 byte AOB signature with wildcards over relocatable operands (cite the byte sequence from the pseudo-C/disassembly).
  2. Notify the user that Phase 7 is ready.

The user (or a Claude session on the host) does: 3. Load both BNDBs in Binary Ninja, run BinDiff (or use BN's native diff). Locate the matching function in EoR. 4. Verify the AOB signature still matches in EoR (small mods are OK — adjust wildcards as needed). 5. Write back to Z:\leak-hunt\phase7-result.md: EoR RVA, confirmed signature, any structural differences worth knowing.

You (VM Claude) resume once that file appears.


Phase 8 — Patch DLL (target: 1 day)

Goal: ship a DLL that, when loaded into acclient.exe (any build that matches the signature), plugs the leak.

  • Same scaffold as the controller DLL — MinHook + a launcher EXE.
  • Hook the leaking function (or its caller, whichever is cleanest).
  • Wrap the existing logic so the missing free is performed on the bug path.
  • Provide a versioned filename and a small README so it's clear which client build it targets.
  • Verify on the 2013 client first — same UMDH soak, expect the top growing stack to vanish.

Phase 9 — Multi-day soak validation (target: 5+ days)

Goal: prove the patch fixes the production crash.

  1. Install the patch DLL injector into the EoR client setup.
  2. Launch under the supervisor with snapshots every hour (lower freq — we're not hunting now, just confirming).
  3. Run a controlled activity profile (the bot's full rotation) for 5+ days continuous.
  4. Pass: no OOM crash, committed memory stable or decreasing slope. Fail: crash before 5 days → back to Phase 5 to find the second leak.

Phase 10 — Ship

  1. Write up findings:
    • The leak's root cause (one paragraph)
    • The patch's mechanism (one paragraph)
    • The 2013-vs-EoR signature note
    • Validation evidence (UMDH diffs, soak duration, growth-rate plot if you have one)
  2. Save the writeup to Z:\leak-hunt\REPORT.md and to memory.
  3. Notify the user. Stop the loop. Done.

7. Wake-up protocol

Use ScheduleWakeup to self-pace between phases. Default cadence table:

Situation Delay Why
Active analysis (reading UMDH diffs, writing code) none — stay engaged Full-context work
Between snapshots inside a soak (Phase 1/4/9) 15001800 s Cache stays warm-ish, snapshots accumulate
Overnight gap (≥6 h) 3600 s and chain One cache miss is cheap vs. burning per-hour
Waiting for user (Phase 7 handoff) 3600 s Poll for the result file

Pass the same loop prompt each turn. The reason field should identify the phase and what you'll check (e.g. "phase 4 snapshot check — read snap_012 and diff against snap_001").


8. When to stop and ask the user

  • Phase 0 verification fails (PDB mismatch, login fails, ACE not reachable). Don't guess at fixes.
  • The bot wedges and auto-recovery fails twice in a row.
  • You're about to expand scope (refactor the supervisor into a framework, build a UI for snapshot review, port code into acdream's tree). Stop and ask. Default answer is no.
  • You hit a decision gate where the data is genuinely ambiguous (e.g. growth rate is moderate but no single stack dominates).
  • Phase 5 produces a function name that isn't in symbols.json. Probably means an indirect call or vtable dispatch — ask before spending hours decoding it.
  • The patch in Phase 9 doesn't validate. Don't iterate indefinitely; surface findings and re-plan.

9. Memory protocol

Save findings as memory entries so a session that wakes 8 hours later can resume cold. Specifically:

  • project_leak_hunt.md — top-level project context, current phase, open questions
  • leak_hunt_phase_N.md — per-phase findings, growth rates, decisions
  • leak_hunt_candidate_<funcname>.md — once a function is suspected, everything you know about it
  • feedback_leak_hunt_<topic>.md — if the user gives operational feedback during this hunt, record it

Update MEMORY.md index entry for each. Keep entries short and factual; long writeups go in Z:\leak-hunt\ files referenced from memory.


10. Hard rules (do not violate)

  1. Don't run anything that touches the user's host machine. The VM is isolated for a reason. All output goes through the shared folder.
  2. Don't disable gflags+UST mid-run. If you need to disable, stop the supervisor, disable, take a fresh baseline.
  3. Don't modify acclient.exe on disk. All patches are runtime DLL hooks. If you ever feel tempted to binary-patch the exe directly, ask the user first.
  4. Don't auto-update MEMORY.md without first saving the underlying memory file. The index must point at real files.
  5. Don't claim a leak is found without the evidence checklist:
    • ≥3 consecutive UMDH diffs showing the same top stack growing
    • The stack is attributed to a named function in symbols.json
    • The call site is identified in acclient_2013_pseudo_c.txt
    • The hypothesis for the missing free is stated
  6. Don't proceed to Phase 8 without Phase 7 handoff complete. The patch must target the EoR signature, not the 2013 RVA.

11. First action when you start a fresh session

1. Read README.md (this file) end-to-end.
2. Read CLAUDE.md (project rules — concise).
3. Run the configuration questions in §5 by the user.
4. Save their answers as memory.
5. Begin Phase 0.

Good hunting.