leakhunt/README.md
acbot 57b5e43d0e Initial commit — leak-hunt project complete
Five bugs identified and patched in retail Asheron's Call client:
- v3b: palette refcount over-increment (3-byte NOP at two sites)
- v5: RenderSurface PurgeResource no-op stub (vtable slot 2 thunk)
- v11: two dangling-pointer crash guards (NULL-check + reorder)
- v14: CEnvCell::Destroy ClipPlaneList leak (18-byte JMP to cleanup thunk)
- v22: unpacker stale-pointer SEH guard (whole-function __try/__except)

All five ship in leakfix.dll (117 KB, SHA d282f23c…) which is loaded
by acclient.exe at process start via PE import table patching by
tools/install_leakfix.py.

Controlled 15-client fleet soak: unpatched control died at 26h with
palette exhaustion; all 14 patched clients survived past that point
and reached ≥5-day uptime.

Residual ~15 MB/h growth traced to d3d9.dll's internal slab allocator
(260KB surface backing buffers retained after Release). See REPORT.md
§10 for the full investigation; conclusion is that it's unfixable from
outside d3d9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 21:07:58 +02:00

697 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Retail AC Memory Leak Hunt
> **Status: COMPLETE 2026-05-22.** Five bugs found and patched in the
> retail AC client. Controlled fleet soak showed the unpatched control
> died at 26h with palette exhaustion; all 14 patched clients survived
> past that point and ran for ≥5-day uptime. The residual ~15 MB/h
> growth was traced to d3d9.dll's internal slab allocator and is
> unfixable from outside d3d9.
>
> If you just want to install: drop `dll/leakfix/dist/leakfix.dll`
> into your AC directory and run
> `python tools/install_leakfix.py "C:\path\to\AC"`. The installer
> patches `acclient.exe`'s import table to load `leakfix.dll` at
> startup. Idempotent — safe to re-run.
---
## What ships
| Patch | Bug | One-line fix |
|---|---|---|
| **v3b** | Palette refcount over-increment in `makeModifiedPalette` | NOP the `inc [eax+0x24]` at two sites |
| **v5** | `RenderSurface::PurgeResource` is a no-op stub | Override vtable slot 2 to call `Destroy()` for real |
| **v11** | Two dangling-pointer dereferences in `delete_contents` + `~GXTri3Mesh` | NULL-check guards |
| **v14** | `CEnvCell::Destroy` leaks the `ClipPlaneList` (just zeros the count) | Replace the 18-byte buggy block with a JMP to a thunk that actually frees the list |
| **v22** | Server-driven AV in the unpacker function at `0x00526A50` (5-client mass crash 2026-05-21 09:00) | Wrap the function in `__try / __except`, return 0 on AV (which the engine already handles as the size-check-failure code path) |
All five plus a crash-handler ship in `leakfix.dll`. Patches are
applied 30 seconds after process start (deferred so Decal/UB win their
own init race first). Crash handler is installed immediately so any
crashes during the 30s window are still captured.
## Patch pseudo-code
### v3b — palette refcount over-increment
The engine's palette-cache hit path increments the cached entry's
refcount **twice** (once in the cache lookup, once in the constructor
that wraps it). Result: refcount grows monotonically; nothing ever
hits zero; palettes accumulate until the 32-bit address space
exhausts (~26h on heavy-loot clients).
```asm
; at 0x0053EFFE (and 0x0053F19C, the sibling overload)
; before patch: after patch:
; ff 40 24 inc dword ptr [eax+0x24] 90 90 90 nop nop nop
```
```c
// effect, expressed in C:
// before: refcount++ twice per cache hit
// after: refcount++ once per cache hit (the outer increment is removed)
```
### v5 — RenderSurface PurgeResource override
`RenderSurface`'s and `RenderTexture`'s `PurgeResource` virtual slot
points at `0x004154A0`, which is `mov al, 1; ret` — a no-op stub.
When the resource manager's purge sweep walks `s_Resources` and calls
`PurgeResource()` on each entry, the call returns "1 = purged" but
the resource's D3D handle + heap state is never touched. Result:
purged-shell accumulation in `s_Resources`.
```c
// before — at slot 2 of the RenderSurface vtable (0x0079A684):
// PurgeResource = noop_stub; // 0x004154A0
// int noop_stub() { return 1; }
//
// after — slot 2 redirected to our thunk in leakfix.dll:
int purge_rendersurface_thunk(RenderSurface* self) {
RenderSurface::Destroy(self); // real cleanup
return 1; // engine marks entry purged
}
// same fix mirrored to RenderTexture slot 2 (0x0079C1A0).
```
### v11 — two dangling-pointer crash guards
Two places where the engine dereferences a pointer that's been freed
elsewhere. Both manifest as AVs that take the process down.
**Site 1**`delete_contents` hash walk (`0x00587126`):
The loop falls through into a dereference of an already-freed bucket
node when the bucket chain was rebuilt mid-walk. Fix: retarget the
JMP so the freed-bucket branch jumps to the epilogue, skipping the
deref.
```asm
; before: eb 07 jmp +0x07 ; into the deref
; after: eb 42 jmp +0x42 ; into the epilogue (skip deref)
```
**Site 2**`~GXTri3Mesh` slot 0 deref (`0x005E565D`):
Destructor of `GXTri3Mesh` reads its slot[0] *then* zeros it. If
slot[0] is stale (some other path already freed it), the deref AVs.
Fix: reorder so we zero first; never deref a slot we can't trust.
```asm
; before: after:
; 8B 08 mov ecx, [eax] 89 5E 08 mov [esi+8], ebx ; zero first
; 50 push eax 90 ... 90 nop x6 ; skip deref + call
; FF 51 08 call [ecx+8]
; 89 5E 08 mov [esi+8], ebx
```
### v14 — CEnvCell::ClipPlaneList leak
`CEnvCell::Destroy` contains an 18-byte cleanup block that **only
zeros `cplane_num`** — never frees the underlying `ClipPlaneList`
object hanging off `[this+0xDC]`. Every cell unload leaks one of
these. Replace the broken block with a `JMP` to a thunk in
leakfix.dll that does the real cleanup:
```c
// thunk pseudo-code:
void v14_clipplane_cleanup_thunk(CEnvCell* self) {
ClipPlaneListWrapper* outer = self->cplane_wrapper; // [esi+0xDC]
if (outer) {
ClipPlaneList* inner = outer->inner; // [outer+0x0]
if (inner) {
inner->~ClipPlaneList();
operator delete(inner);
}
operator delete[](outer);
self->cplane_wrapper = nullptr;
}
// jump back to V14_RESUME_VA (just past the original 18-byte block)
}
```
### v22 — unpacker stale-pointer SEH guard
A small inline unpacker at `0x00526A50` pulls 4 DWORDs from
`arg1->buffer`. On 2026-05-21 the server fed five clients
simultaneously a buffer pointing into freed/kernel memory; all five
AV'd on the 4th deref. The engine *already* has a code path for
"buffer too small / unpack failed" (line 1 of the function checks a
size field and returns 0). We just wrap the whole function body in
SEH and route AVs to that same return-0 path.
```c
// 1. Copy the original 73 bytes of the function to executable memory.
// 2. Patch the original entry with JMP rel32 to our wrapper.
int v22_unpacker_wrapper(this, arg1, count) {
__try {
return original_copy(this, arg1, count); // run the real unpacker
} __except (EXCEPTION_EXECUTE_HANDLER) {
// log + return 0 (engine treats this as size-check failure)
return 0;
}
}
```
## Install
```powershell
# 1. Copy leakfix.dll into your AC directory
Copy-Item .\dll\leakfix\build\leakfix.dll "C:\Turbine\Asheron's Call\"
# 2. Patch acclient.exe to import leakfix.dll
python tools\install_leakfix.py "C:\Turbine\Asheron's Call"
# 3. Verify
python tools\install_leakfix.py "C:\Turbine\Asheron's Call" verify
```
The installer adds a `.limport` PE section to acclient.exe containing
the rebuilt import table. It backs up the original to
`acclient.exe.bare_original` on first run, and is idempotent.
## Roll back
```powershell
Copy-Item "C:\Turbine\Asheron's Call\acclient.exe.bare_original" `
"C:\Turbine\Asheron's Call\acclient.exe" -Force
Remove-Item "C:\Turbine\Asheron's Call\leakfix.dll" -ErrorAction Ignore
```
## Files
- `dll/leakfix/src/` — DLL source (C++ with inline asm for the naked thunks)
- `dll/leakfix/dist/leakfix.dll` — current production build (117 KB)
- `dll/leakfix/build.bat` — build script (VS 2022 BuildTools required)
- `tools/install_leakfix.py` — patches acclient.exe to import leakfix.dll
- `tools/check_acclient_imports.py` — verify import table contains leakfix.dll
- `references/` — symbol table, pseudo-C, header for the 2013 client (PDB-backed)
The rest of this document is the original VM operator brief that
drove the investigation. Preserved for context but no longer
operationally relevant — the hunt is done.
---
# Retail AC Memory Leak Hunt — VM Operator Brief
**You are picking this up cold on a freshly-provisioned Windows VM.**
This document is your full mission brief. Read it end-to-end before
running anything, then drive the work autonomously, using
`ScheduleWakeup` (Claude Code) to pace long-running operations between
your active turns.
---
## 1. Mission
Find and patch a memory leak in the retail Asheron's Call client. The
production symptom is a hard crash after ~45 days of continuous play
on the **End-of-Retail (EoR, ~Jan 2017) client**. We don't have symbols
for that binary — but we have **full PDB symbols for the Sept 2013
v11.4186 client**, which almost certainly carries the same leak (AC was
in pure maintenance mode 2013→2017, very little net new code).
**The hunt happens on the 2013 client (symbolized).**
**The patch ships against the EoR client (via BinDiff-forward).**
### What "done" looks like
1. A specific function in the 2013 client is identified as the leak
source, with evidence: monotonic UMDH growth across multiple
snapshot diffs attributed to that function's call stack.
2. The corresponding function in the EoR client is located via
BinDiff (this step happens on the **host machine**, not the VM —
the BNDB files live there).
3. A DLL-injection patch is built that hooks the EoR function and
plugs the leak (typically: adds a missing `delete`/`Release`/decref
on a known path).
4. A 5+ day soak on EoR with the patch installed completes without
the OOM crash that reproduces unpatched in the same window.
### Hard scope boundary
This is a self-contained side quest. **Do not** expand it into a
general retail-instrumentation framework, a fork of the controller
DLL into a fully-featured bot, a parallel acdream feature, or "while
I'm here" refactors of the AC2D/Mosswart tooling. Find leak → patch →
validate → ship → done. If you catch yourself reaching for adjacent
work, stop and re-read this paragraph.
---
## 2. Why this works (assumptions you can rely on)
- **Compiler & toolchain stability.** 2013 and EoR were both built with
the same VC++ family on the same Turbine build farm. Binary structure
is highly similar.
- **Code stability.** AC went into maintenance after Throne of
Destiny (2005) and stayed there. Most of the codebase did not change
meaningfully between 2013 and EoR. A leak severe enough to crash in
45 days has almost certainly been present for many years.
- **PDB → BinDiff path is mature.** `BinDiff` and `Diaphora` routinely
achieve 8095% function-match rates across related VC++ binaries.
Once you identify the leaking function in 2013 (with name), porting
the symbol forward to EoR is signature-scan-able.
### What you're betting on, and the fallback
- **Primary bet:** the leak repros on the 2013 client. UMDH on the
2013 client + activity bot reveals it within hours-to-days. You
identify a named function, hand the name to the host for BinDiff,
receive the EoR signature back, build the patch DLL, validate.
- **Fallback:** the leak does NOT repro on 2013 — i.e. it was
introduced after Sept 2013. In that case, you fall back to hunting
on the EoR client without symbols, using BinDiff-transferred names
for whatever functions match the 2013 codebase. This is slower but
still feasible. The primary-vs-fallback determination is **Phase 1
Decision Gate** below.
---
## 3. Package contents
```
leak-hunt-vm-2026-05-12/
├── README.md ← you are here
├── MANIFEST.md ← list of out-of-repo files copied in
├── CLAUDE.md ← VM-side project rules (persistent)
├── templates/
│ ├── supervisor.ps1 ← skeleton — start ACE, start client, snapshot loop
│ ├── snapshot.ps1 ← UMDH single-shot
│ ├── activity-phases.json ← phase schedule template
│ ├── login.ahk ← AutoHotkey login skeleton
│ └── trace.cdb ← cdb scripting template
├── tools/
│ ├── check_exe_pdb.py ← verify binary ↔ PDB GUID match
│ ├── dump_pdb_info.py ← PDB metadata
│ └── pdb_extract.py ← regenerate symbols.json if needed
├── pdb/
│ └── acclient.pdb ← (29 MB, copied per MANIFEST)
└── references/
├── symbols.json ← 18,366 named functions + addresses (grep-friendly)
├── types.json ← 5,371 struct/class type definitions
├── acclient.h ← verbatim retail header structs
└── acclient_2013_pseudo_c.txt ← 64 MB symbolized Binary Ninja pseudo-C
```
The Python tools are stdlib-only (no pip). Everything else is data.
---
## 4. What you need on the VM (one-time, before starting)
If any of these is missing, **ask the user before guessing**.
| Component | Where | Notes |
|---|---|---|
| Retail AC client (2013 v11.4186) | `C:\Turbine\Asheron's Call\` | Standard install path. Verify match with `check_exe_pdb.py` before any other work. The `_NT_SYMBOL_PATH` must include `pdb/`. |
| Retail AC dat files | inside the install | `client_portal.dat`, `client_cell_1.dat`, `client_highres.dat`, `client_local_English.dat` |
| ACE server | `127.0.0.1:9000` on VM | Use ACEmulator from github.com/ACEmulator/ACE. Same config as user's dev box. Confirm it accepts logins before continuing. |
| Test character | on the VM's ACE | Suggested name: `+Leakhunt`. GM-marker `+` so debug commands are available. |
| Windows Debugging Tools | Microsoft Store WinDbg or Win10/11 SDK | Need `cdb.exe`, `umdh.exe`, `gflags.exe`. 32-bit (`x86`) versions — `acclient.exe` is 32-bit. |
| AutoHotkey v2 | autohotkey.com | For login automation. v2 only — templates assume v2 syntax. |
| Sysinternals `procdump` | sysinternals.com | Crash-dump capture. |
| MinHook (optional, for patch DLL) | github.com/TsudaKageyu/minhook | Only needed at Phase 8. Defer. |
| Shared folder or mounted drive | `Z:\` or similar | For passing snapshots back to host. Configure at VM-setup time. |
---
## 5. Configuration questions to ask the user at session start
**Ask these first, before running anything.** They materially affect
the harness.
1. **Where is ACE running** — same VM (recommended; snapshot-clean) or
on the host with VM networking through to it? Default assumption:
same VM.
2. **What's the AC install path** if it's not the standard
`C:\Turbine\Asheron's Call\`?
3. **Output flow** — shared folder path? Or push artifacts to a git
branch (e.g. `leak-hunt-vm/2026-05-12`)? Default: shared folder
to `Z:\leak-hunt\` on host.
4. **Test character name** on the VM ACE? Default: `+Leakhunt`.
5. **VM specs** — RAM and core count? (Affects whether to enable
gflags+UST from the start, which costs ~2030% perf.)
6. **EoR binary location on host** — confirm the user has it at
`C:\Users\erikn\source\repos\acdream\refs\acclient-eor-2024-09-11.bndb`
(Binary Ninja db). This isn't needed on the VM but is critical for
Phase 7 BinDiff on the host.
7. **Wake-up cadence preference** — do they want you to use
`ScheduleWakeup` for hours-long gaps, or stay continuously
active? Default: ScheduleWakeup for any gap > 30 min.
Save the user's answers as memory entries before proceeding past
Phase 0 so a future session can pick up cold.
---
## 6. Phased plan
Each phase has a **goal**, **commands**, **decision gate**, and
**estimated time**. Don't skip ahead. Don't run multiple phases in
parallel until Phase 4.
### Phase 0 — Verify the bench (target: 30 min)
**Goal:** prove the environment can launch AC, log in, and observe
memory.
1. `py tools/check_exe_pdb.py "C:\Turbine\Asheron's Call\acclient.exe"`.
Expect: `=== MATCH: this exe pairs with our acclient.pdb ===`.
If MISMATCH → stop, ask the user which build they installed.
2. `py tools/dump_pdb_info.py pdb/acclient.pdb`. Confirm GUID
`9e847e2f-777c-4bd9-886c-22256bb87f32`, age 1.
3. Start ACE locally (`dotnet run` in the ACE checkout, or
`ACE.exe` if pre-built). Confirm it listens on `127.0.0.1:9000`.
4. Manually launch AC, log in with the test character, walk one
step, log out. **This proves the bench works before you add
instrumentation.**
5. Take a clean Hyper-V / VMware snapshot named `bench-verified`.
The supervisor will revert to this before each run.
**Decision gate:** can you launch, log in, walk, log out, clean?
If no, fix this before anything else. If yes, proceed.
---
### Phase 1 — Idle baseline + decide hunt platform (target: 4 hours)
**Goal:** does the leak reproduce on the 2013 client when the player
sits at the lifestone doing nothing? If yes, primary plan; if no,
Phase 2 will find the right activity profile.
1. Enable heap allocation tagging:
`gflags /i acclient.exe +ust`. This is registry-set; survives
reboots. (Disable later with `gflags /i acclient.exe -ust`.)
2. Set `_NT_SYMBOL_PATH=<vm-path-to-pdb-dir>`.
3. Launch AC via `templates/supervisor.ps1` (which sets env and
spawns the process). Log in manually OR via `templates/login.ahk`.
4. Walk to a quiet spot (lifestone interior, away from spawn). Sit.
5. `templates/snapshot.ps1 -ProcessId <pid> -Out snap_001.txt`
immediately. Wait 30 min. Take `snap_002`. Repeat for at least
4 hours (8 snapshots).
6. `umdh -d snap_001 snap_008 -f:diff_idle.txt`. Read top 20
growing stacks. Save the diff to `Z:\leak-hunt\phase1\`.
**Decision gate:**
- **Total committed memory grew >50 MB over 4h idle?** Leak repros at
idle. Skip Phase 2, jump to Phase 4 (long-soak idle).
- **Total committed grew 550 MB?** Leak may need amplification.
Proceed to Phase 2.
- **Total committed grew <5 MB?** Leak is activity-specific or
doesn't exist on 2013. Proceed to Phase 2.
- **Memory dropped or oscillates around 0?** No leak signal at idle.
Phase 2 is where you'll find it (or won't).
Record the baseline growth-rate number in memory:
`leak_hunt_phase1_baseline_mb_per_hour`.
---
### Phase 2 — Activity-phase characterization (target: 12 days)
**Goal:** find which player activity causes the leak. The bot is not
yet built you drive this manually with the activity-phase template
running as an AHK macro, or by playing 30-min phases yourself if no
bot is available.
The five canonical phases (see `templates/activity-phases.json`):
1. **idle** stand at lifestone, no input
2. **wander** walk a fixed route around Holtburg
3. **chat** spam say/tell/global chat
4. **target-cycle** Tab through nearby NPCs/mobs, no combat
5. **ui-cycle** open/close inventory, character pane, spells
**Procedure per session:**
1. Start fresh from `bench-verified` snapshot.
2. Run a single phase for 1 hour with snapshots every 15 min.
3. `umdh -d` diff the snapshot pair for that phase.
4. Record growth-rate per phase to memory.
5. Repeat for each phase, single VM, single phase per run.
(If user has authorized multiple parallel VMs, run different phases
simultaneously instead of sequentially.)
**Decision gate:** rank phases by growth rate. The top phase is your
target for Phase 4 amplification. Save ranking to memory.
---
### Phase 3 — Controller DLL (target: 12 days)
**Goal:** build a small DLL that drives the leaking phase
deterministically and reproducibly, faster than a human can.
**Build approach:**
- C++ DLL, 32-bit, compiled against Visual Studio Build Tools.
- MinHook for function hooking.
- LoadLibrary'd into `acclient.exe` via a small launcher EXE
(CreateProcess SUSPENDED WriteProcessMemory of LoadLibrary
trampoline ResumeThread). Standard injection pattern.
- Hook a frame-loop function from `symbols.json` search
`references/symbols.json` for `CGameLoop`, `WorldFilter`, `Tick`,
`ProcessFrame` and pick the highest-frequency one with stable
signature.
- Call retail functions directly via PDB-resolved addresses. Examples:
`CPhysicsObj::set_velocity`, `CChatManager::SendSay`,
`CPlayerSystem::SelectTarget`. These take a `this` pointer in `ecx`
(thiscall) you'll need a small asm trampoline or use
`__thiscall` calling-convention helpers.
**The bot's job:**
- Drive the top-ranked Phase 2 activity continuously.
- Emit a heartbeat to a log file every 30s so the supervisor can
detect wedging.
- Auto-restart self-position-watchdog: if `CPhysicsObj::position`
hasn't changed in 5 min during a movement phase, signal the
supervisor to revert and retry.
**Reuse opportunity:** the user maintains MosswartMassacre and
MosswartOverlord both are AC client DLL-injection projects. **Ask
the user for read access** before designing from scratch; they may
have a working injector + MinHook scaffolding you can port from in
hours rather than days. Do not assume; ask.
**Decision gate:** bot runs the leaking phase for 1 hour
unattended, emits heartbeats, produces measurable UMDH growth.
---
### Phase 4 — Long-soak with amplification (target: 1248 hours)
**Goal:** generate a clean signal one or two leaking call stacks
visibly dominate the UMDH diff.
1. Revert VM to `bench-verified`.
2. Launch via `supervisor.ps1` with the controller DLL injected.
3. Snapshot every 15 min for 12+ hours.
4. `umdh -d` snap_001 vs snap_N every couple of hours during your
active turns. Between active turns, use `ScheduleWakeup` with
delay 18003600s and the reason `"long-soak snapshot check"`.
**Decision gate:** UMDH diff shows one or more call stacks with
monotonic growth across all adjacent-pair diffs, dominating the
total by 10× over the next-highest. That's your leak candidate(s).
---
### Phase 5 — Identify the leaking function (target: 24 hours)
**Goal:** convert the UMDH call stack into a named function we can
study and patch.
1. The top growing stack will look like:
```
ntdll!RtlAllocateHeap+0x...
acclient!operator new+0x...
acclient!CFoo::AllocateBar+0x42
acclient!CFoo::DoTheThing+0x18
acclient!CGameLoop::Tick+0x...
```
2. The named function is `CFoo::AllocateBar`. Grep
`references/acclient_2013_pseudo_c.txt` for `CFoo::AllocateBar`
to read its body.
3. Identify the paired free function (`CFoo::ReleaseBar`,
`~CFoo`, etc.) and confirm by reading both.
4. Find every call site of `CFoo::AllocateBar` (grep the pseudo-C
for the function name) and verify each has a matching paired
release. The one that doesn't is the bug.
**Decision gate:** you have (a) the leaking function name, (b) the
specific call site that doesn't free, (c) a hypothesis for the
patch (typically: add a `delete` or `Release()` on a specific code
path). Save these to memory + a write-up file.
---
### Phase 6 — Cross-reference with retail debugger trace (optional, target: 2 hours)
**Goal:** confirm the leak path is actually hit at runtime in a real
play scenario, not just statically possible.
This step is optional but recommended if the leak path is conditional
(e.g. "only when the chat buffer wraps"). Use the cdb workflow
documented in `templates/trace.cdb` and the retail-debugger section
in `CLAUDE.md`. Attach to acclient.exe, breakpoint on the alloc
function with a non-blocking action (`r $t0=@$t0+1; gc`), let it
accumulate for 30 min, count hits, correlate with UMDH growth bytes.
---
### Phase 7 — BinDiff to EoR (**HOST MACHINE — not VM**)
**Goal:** produce an EoR-binary signature and offset for the leaking
function.
This phase does not happen on the VM. The Binary Ninja databases
live on the host (`refs/acclient-eor-2024-09-11.bndb`,
`refs/acclient_2013-2024-09-11.bndb`).
**You (VM Claude) do:**
1. Write a structured handoff file
`Z:\leak-hunt\phase7-handoff.md` containing: the function name,
its 2013 RVA, the paired release function name, the suspected
missing-free call site, the call-graph context, a 3248 byte
AOB signature with wildcards over relocatable operands (cite
the byte sequence from the pseudo-C/disassembly).
2. Notify the user that Phase 7 is ready.
**The user (or a Claude session on the host) does:**
3. Load both BNDBs in Binary Ninja, run BinDiff (or use BN's native
diff). Locate the matching function in EoR.
4. Verify the AOB signature still matches in EoR (small mods are
OK — adjust wildcards as needed).
5. Write back to `Z:\leak-hunt\phase7-result.md`: EoR RVA,
confirmed signature, any structural differences worth knowing.
You (VM Claude) resume once that file appears.
---
### Phase 8 — Patch DLL (target: 1 day)
**Goal:** ship a DLL that, when loaded into `acclient.exe` (any
build that matches the signature), plugs the leak.
- Same scaffold as the controller DLL — MinHook + a launcher EXE.
- Hook the leaking function (or its caller, whichever is cleanest).
- Wrap the existing logic so the missing free is performed on the
bug path.
- Provide a versioned filename and a small README so it's clear
which client build it targets.
- **Verify on the 2013 client first** — same UMDH soak, expect the
top growing stack to vanish.
---
### Phase 9 — Multi-day soak validation (target: 5+ days)
**Goal:** prove the patch fixes the production crash.
1. Install the patch DLL injector into the EoR client setup.
2. Launch under the supervisor with snapshots every hour (lower
freq — we're not hunting now, just confirming).
3. Run a controlled activity profile (the bot's full rotation) for
5+ days continuous.
4. **Pass:** no OOM crash, committed memory stable or decreasing
slope. **Fail:** crash before 5 days → back to Phase 5 to find
the second leak.
---
### Phase 10 — Ship
1. Write up findings:
- The leak's root cause (one paragraph)
- The patch's mechanism (one paragraph)
- The 2013-vs-EoR signature note
- Validation evidence (UMDH diffs, soak duration, growth-rate
plot if you have one)
2. Save the writeup to `Z:\leak-hunt\REPORT.md` and to memory.
3. Notify the user. Stop the loop. Done.
---
## 7. Wake-up protocol
Use `ScheduleWakeup` to self-pace between phases. **Default cadence
table:**
| Situation | Delay | Why |
|---|---|---|
| Active analysis (reading UMDH diffs, writing code) | none — stay engaged | Full-context work |
| Between snapshots inside a soak (Phase 1/4/9) | 15001800 s | Cache stays warm-ish, snapshots accumulate |
| Overnight gap (≥6 h) | 3600 s and chain | One cache miss is cheap vs. burning per-hour |
| Waiting for user (Phase 7 handoff) | 3600 s | Poll for the result file |
Pass the same loop prompt each turn. The `reason` field should
identify the phase and what you'll check (e.g. `"phase 4 snapshot
check read snap_012 and diff against snap_001"`).
---
## 8. When to stop and ask the user
- **Phase 0 verification fails** (PDB mismatch, login fails, ACE
not reachable). Don't guess at fixes.
- **The bot wedges and auto-recovery fails twice in a row.**
- **You're about to expand scope** (refactor the supervisor into a
framework, build a UI for snapshot review, port code into
acdream's tree). Stop and ask. Default answer is no.
- **You hit a decision gate where the data is genuinely ambiguous**
(e.g. growth rate is moderate but no single stack dominates).
- **Phase 5 produces a function name that isn't in symbols.json.**
Probably means an indirect call or vtable dispatch — ask before
spending hours decoding it.
- **The patch in Phase 9 doesn't validate.** Don't iterate
indefinitely; surface findings and re-plan.
---
## 9. Memory protocol
Save findings as memory entries so a session that wakes 8 hours later
can resume cold. Specifically:
- `project_leak_hunt.md` — top-level project context, current phase,
open questions
- `leak_hunt_phase_N.md` — per-phase findings, growth rates, decisions
- `leak_hunt_candidate_<funcname>.md` — once a function is suspected,
everything you know about it
- `feedback_leak_hunt_<topic>.md` — if the user gives operational
feedback during this hunt, record it
Update `MEMORY.md` index entry for each. Keep entries short and
factual; long writeups go in `Z:\leak-hunt\` files referenced from
memory.
---
## 10. Hard rules (do not violate)
1. **Don't run anything that touches the user's host machine.** The
VM is isolated for a reason. All output goes through the shared
folder.
2. **Don't disable gflags+UST mid-run.** If you need to disable,
stop the supervisor, disable, take a fresh baseline.
3. **Don't modify `acclient.exe` on disk.** All patches are
runtime DLL hooks. If you ever feel tempted to binary-patch the
exe directly, ask the user first.
4. **Don't auto-update `MEMORY.md`** without first saving the
underlying memory file. The index must point at real files.
5. **Don't claim a leak is found** without the evidence checklist:
- ≥3 consecutive UMDH diffs showing the same top stack growing
- The stack is attributed to a named function in `symbols.json`
- The call site is identified in `acclient_2013_pseudo_c.txt`
- The hypothesis for the missing free is stated
6. **Don't proceed to Phase 8 without Phase 7 handoff complete.**
The patch must target the EoR signature, not the 2013 RVA.
---
## 11. First action when you start a fresh session
```
1. Read README.md (this file) end-to-end.
2. Read CLAUDE.md (project rules — concise).
3. Run the configuration questions in §5 by the user.
4. Save their answers as memory.
5. Begin Phase 0.
```
Good hunting.