Five bugs identified and patched in retail Asheron's Call client: - v3b: palette refcount over-increment (3-byte NOP at two sites) - v5: RenderSurface PurgeResource no-op stub (vtable slot 2 thunk) - v11: two dangling-pointer crash guards (NULL-check + reorder) - v14: CEnvCell::Destroy ClipPlaneList leak (18-byte JMP to cleanup thunk) - v22: unpacker stale-pointer SEH guard (whole-function __try/__except) All five ship in leakfix.dll (117 KB, SHA d282f23c…) which is loaded by acclient.exe at process start via PE import table patching by tools/install_leakfix.py. Controlled 15-client fleet soak: unpatched control died at 26h with palette exhaustion; all 14 patched clients survived past that point and reached ≥5-day uptime. Residual ~15 MB/h growth traced to d3d9.dll's internal slab allocator (260KB surface backing buffers retained after Release). See REPORT.md §10 for the full investigation; conclusion is that it's unfixable from outside d3d9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
697 lines
28 KiB
Markdown
697 lines
28 KiB
Markdown
# Retail AC Memory Leak Hunt
|
||
|
||
> **Status: COMPLETE 2026-05-22.** Five bugs found and patched in the
|
||
> retail AC client. Controlled fleet soak showed the unpatched control
|
||
> died at 26h with palette exhaustion; all 14 patched clients survived
|
||
> past that point and ran for ≥5-day uptime. The residual ~15 MB/h
|
||
> growth was traced to d3d9.dll's internal slab allocator and is
|
||
> unfixable from outside d3d9.
|
||
>
|
||
> If you just want to install: drop `dll/leakfix/dist/leakfix.dll`
|
||
> into your AC directory and run
|
||
> `python tools/install_leakfix.py "C:\path\to\AC"`. The installer
|
||
> patches `acclient.exe`'s import table to load `leakfix.dll` at
|
||
> startup. Idempotent — safe to re-run.
|
||
|
||
---
|
||
|
||
## What ships
|
||
|
||
| Patch | Bug | One-line fix |
|
||
|---|---|---|
|
||
| **v3b** | Palette refcount over-increment in `makeModifiedPalette` | NOP the `inc [eax+0x24]` at two sites |
|
||
| **v5** | `RenderSurface::PurgeResource` is a no-op stub | Override vtable slot 2 to call `Destroy()` for real |
|
||
| **v11** | Two dangling-pointer dereferences in `delete_contents` + `~GXTri3Mesh` | NULL-check guards |
|
||
| **v14** | `CEnvCell::Destroy` leaks the `ClipPlaneList` (just zeros the count) | Replace the 18-byte buggy block with a JMP to a thunk that actually frees the list |
|
||
| **v22** | Server-driven AV in the unpacker function at `0x00526A50` (5-client mass crash 2026-05-21 09:00) | Wrap the function in `__try / __except`, return 0 on AV (which the engine already handles as the size-check-failure code path) |
|
||
|
||
All five plus a crash-handler ship in `leakfix.dll`. Patches are
|
||
applied 30 seconds after process start (deferred so Decal/UB win their
|
||
own init race first). Crash handler is installed immediately so any
|
||
crashes during the 30s window are still captured.
|
||
|
||
## Patch pseudo-code
|
||
|
||
### v3b — palette refcount over-increment
|
||
The engine's palette-cache hit path increments the cached entry's
|
||
refcount **twice** (once in the cache lookup, once in the constructor
|
||
that wraps it). Result: refcount grows monotonically; nothing ever
|
||
hits zero; palettes accumulate until the 32-bit address space
|
||
exhausts (~26h on heavy-loot clients).
|
||
|
||
```asm
|
||
; at 0x0053EFFE (and 0x0053F19C, the sibling overload)
|
||
; before patch: after patch:
|
||
; ff 40 24 inc dword ptr [eax+0x24] 90 90 90 nop nop nop
|
||
```
|
||
|
||
```c
|
||
// effect, expressed in C:
|
||
// before: refcount++ twice per cache hit
|
||
// after: refcount++ once per cache hit (the outer increment is removed)
|
||
```
|
||
|
||
### v5 — RenderSurface PurgeResource override
|
||
`RenderSurface`'s and `RenderTexture`'s `PurgeResource` virtual slot
|
||
points at `0x004154A0`, which is `mov al, 1; ret` — a no-op stub.
|
||
When the resource manager's purge sweep walks `s_Resources` and calls
|
||
`PurgeResource()` on each entry, the call returns "1 = purged" but
|
||
the resource's D3D handle + heap state is never touched. Result:
|
||
purged-shell accumulation in `s_Resources`.
|
||
|
||
```c
|
||
// before — at slot 2 of the RenderSurface vtable (0x0079A684):
|
||
// PurgeResource = noop_stub; // 0x004154A0
|
||
// int noop_stub() { return 1; }
|
||
//
|
||
// after — slot 2 redirected to our thunk in leakfix.dll:
|
||
int purge_rendersurface_thunk(RenderSurface* self) {
|
||
RenderSurface::Destroy(self); // real cleanup
|
||
return 1; // engine marks entry purged
|
||
}
|
||
// same fix mirrored to RenderTexture slot 2 (0x0079C1A0).
|
||
```
|
||
|
||
### v11 — two dangling-pointer crash guards
|
||
Two places where the engine dereferences a pointer that's been freed
|
||
elsewhere. Both manifest as AVs that take the process down.
|
||
|
||
**Site 1** — `delete_contents` hash walk (`0x00587126`):
|
||
The loop falls through into a dereference of an already-freed bucket
|
||
node when the bucket chain was rebuilt mid-walk. Fix: retarget the
|
||
JMP so the freed-bucket branch jumps to the epilogue, skipping the
|
||
deref.
|
||
|
||
```asm
|
||
; before: eb 07 jmp +0x07 ; into the deref
|
||
; after: eb 42 jmp +0x42 ; into the epilogue (skip deref)
|
||
```
|
||
|
||
**Site 2** — `~GXTri3Mesh` slot 0 deref (`0x005E565D`):
|
||
Destructor of `GXTri3Mesh` reads its slot[0] *then* zeros it. If
|
||
slot[0] is stale (some other path already freed it), the deref AVs.
|
||
Fix: reorder so we zero first; never deref a slot we can't trust.
|
||
|
||
```asm
|
||
; before: after:
|
||
; 8B 08 mov ecx, [eax] 89 5E 08 mov [esi+8], ebx ; zero first
|
||
; 50 push eax 90 ... 90 nop x6 ; skip deref + call
|
||
; FF 51 08 call [ecx+8]
|
||
; 89 5E 08 mov [esi+8], ebx
|
||
```
|
||
|
||
### v14 — CEnvCell::ClipPlaneList leak
|
||
`CEnvCell::Destroy` contains an 18-byte cleanup block that **only
|
||
zeros `cplane_num`** — never frees the underlying `ClipPlaneList`
|
||
object hanging off `[this+0xDC]`. Every cell unload leaks one of
|
||
these. Replace the broken block with a `JMP` to a thunk in
|
||
leakfix.dll that does the real cleanup:
|
||
|
||
```c
|
||
// thunk pseudo-code:
|
||
void v14_clipplane_cleanup_thunk(CEnvCell* self) {
|
||
ClipPlaneListWrapper* outer = self->cplane_wrapper; // [esi+0xDC]
|
||
if (outer) {
|
||
ClipPlaneList* inner = outer->inner; // [outer+0x0]
|
||
if (inner) {
|
||
inner->~ClipPlaneList();
|
||
operator delete(inner);
|
||
}
|
||
operator delete[](outer);
|
||
self->cplane_wrapper = nullptr;
|
||
}
|
||
// jump back to V14_RESUME_VA (just past the original 18-byte block)
|
||
}
|
||
```
|
||
|
||
### v22 — unpacker stale-pointer SEH guard
|
||
A small inline unpacker at `0x00526A50` pulls 4 DWORDs from
|
||
`arg1->buffer`. On 2026-05-21 the server fed five clients
|
||
simultaneously a buffer pointing into freed/kernel memory; all five
|
||
AV'd on the 4th deref. The engine *already* has a code path for
|
||
"buffer too small / unpack failed" (line 1 of the function checks a
|
||
size field and returns 0). We just wrap the whole function body in
|
||
SEH and route AVs to that same return-0 path.
|
||
|
||
```c
|
||
// 1. Copy the original 73 bytes of the function to executable memory.
|
||
// 2. Patch the original entry with JMP rel32 to our wrapper.
|
||
int v22_unpacker_wrapper(this, arg1, count) {
|
||
__try {
|
||
return original_copy(this, arg1, count); // run the real unpacker
|
||
} __except (EXCEPTION_EXECUTE_HANDLER) {
|
||
// log + return 0 (engine treats this as size-check failure)
|
||
return 0;
|
||
}
|
||
}
|
||
```
|
||
|
||
## Install
|
||
|
||
```powershell
|
||
# 1. Copy leakfix.dll into your AC directory
|
||
Copy-Item .\dll\leakfix\build\leakfix.dll "C:\Turbine\Asheron's Call\"
|
||
|
||
# 2. Patch acclient.exe to import leakfix.dll
|
||
python tools\install_leakfix.py "C:\Turbine\Asheron's Call"
|
||
|
||
# 3. Verify
|
||
python tools\install_leakfix.py "C:\Turbine\Asheron's Call" verify
|
||
```
|
||
|
||
The installer adds a `.limport` PE section to acclient.exe containing
|
||
the rebuilt import table. It backs up the original to
|
||
`acclient.exe.bare_original` on first run, and is idempotent.
|
||
|
||
## Roll back
|
||
|
||
```powershell
|
||
Copy-Item "C:\Turbine\Asheron's Call\acclient.exe.bare_original" `
|
||
"C:\Turbine\Asheron's Call\acclient.exe" -Force
|
||
Remove-Item "C:\Turbine\Asheron's Call\leakfix.dll" -ErrorAction Ignore
|
||
```
|
||
|
||
## Files
|
||
|
||
- `dll/leakfix/src/` — DLL source (C++ with inline asm for the naked thunks)
|
||
- `dll/leakfix/dist/leakfix.dll` — current production build (117 KB)
|
||
- `dll/leakfix/build.bat` — build script (VS 2022 BuildTools required)
|
||
- `tools/install_leakfix.py` — patches acclient.exe to import leakfix.dll
|
||
- `tools/check_acclient_imports.py` — verify import table contains leakfix.dll
|
||
- `references/` — symbol table, pseudo-C, header for the 2013 client (PDB-backed)
|
||
|
||
The rest of this document is the original VM operator brief that
|
||
drove the investigation. Preserved for context but no longer
|
||
operationally relevant — the hunt is done.
|
||
|
||
---
|
||
|
||
# Retail AC Memory Leak Hunt — VM Operator Brief
|
||
|
||
**You are picking this up cold on a freshly-provisioned Windows VM.**
|
||
This document is your full mission brief. Read it end-to-end before
|
||
running anything, then drive the work autonomously, using
|
||
`ScheduleWakeup` (Claude Code) to pace long-running operations between
|
||
your active turns.
|
||
|
||
---
|
||
|
||
## 1. Mission
|
||
|
||
Find and patch a memory leak in the retail Asheron's Call client. The
|
||
production symptom is a hard crash after ~4–5 days of continuous play
|
||
on the **End-of-Retail (EoR, ~Jan 2017) client**. We don't have symbols
|
||
for that binary — but we have **full PDB symbols for the Sept 2013
|
||
v11.4186 client**, which almost certainly carries the same leak (AC was
|
||
in pure maintenance mode 2013→2017, very little net new code).
|
||
|
||
**The hunt happens on the 2013 client (symbolized).**
|
||
**The patch ships against the EoR client (via BinDiff-forward).**
|
||
|
||
### What "done" looks like
|
||
|
||
1. A specific function in the 2013 client is identified as the leak
|
||
source, with evidence: monotonic UMDH growth across multiple
|
||
snapshot diffs attributed to that function's call stack.
|
||
2. The corresponding function in the EoR client is located via
|
||
BinDiff (this step happens on the **host machine**, not the VM —
|
||
the BNDB files live there).
|
||
3. A DLL-injection patch is built that hooks the EoR function and
|
||
plugs the leak (typically: adds a missing `delete`/`Release`/decref
|
||
on a known path).
|
||
4. A 5+ day soak on EoR with the patch installed completes without
|
||
the OOM crash that reproduces unpatched in the same window.
|
||
|
||
### Hard scope boundary
|
||
|
||
This is a self-contained side quest. **Do not** expand it into a
|
||
general retail-instrumentation framework, a fork of the controller
|
||
DLL into a fully-featured bot, a parallel acdream feature, or "while
|
||
I'm here" refactors of the AC2D/Mosswart tooling. Find leak → patch →
|
||
validate → ship → done. If you catch yourself reaching for adjacent
|
||
work, stop and re-read this paragraph.
|
||
|
||
---
|
||
|
||
## 2. Why this works (assumptions you can rely on)
|
||
|
||
- **Compiler & toolchain stability.** 2013 and EoR were both built with
|
||
the same VC++ family on the same Turbine build farm. Binary structure
|
||
is highly similar.
|
||
- **Code stability.** AC went into maintenance after Throne of
|
||
Destiny (2005) and stayed there. Most of the codebase did not change
|
||
meaningfully between 2013 and EoR. A leak severe enough to crash in
|
||
4–5 days has almost certainly been present for many years.
|
||
- **PDB → BinDiff path is mature.** `BinDiff` and `Diaphora` routinely
|
||
achieve 80–95% function-match rates across related VC++ binaries.
|
||
Once you identify the leaking function in 2013 (with name), porting
|
||
the symbol forward to EoR is signature-scan-able.
|
||
|
||
### What you're betting on, and the fallback
|
||
|
||
- **Primary bet:** the leak repros on the 2013 client. UMDH on the
|
||
2013 client + activity bot reveals it within hours-to-days. You
|
||
identify a named function, hand the name to the host for BinDiff,
|
||
receive the EoR signature back, build the patch DLL, validate.
|
||
- **Fallback:** the leak does NOT repro on 2013 — i.e. it was
|
||
introduced after Sept 2013. In that case, you fall back to hunting
|
||
on the EoR client without symbols, using BinDiff-transferred names
|
||
for whatever functions match the 2013 codebase. This is slower but
|
||
still feasible. The primary-vs-fallback determination is **Phase 1
|
||
Decision Gate** below.
|
||
|
||
---
|
||
|
||
## 3. Package contents
|
||
|
||
```
|
||
leak-hunt-vm-2026-05-12/
|
||
├── README.md ← you are here
|
||
├── MANIFEST.md ← list of out-of-repo files copied in
|
||
├── CLAUDE.md ← VM-side project rules (persistent)
|
||
├── templates/
|
||
│ ├── supervisor.ps1 ← skeleton — start ACE, start client, snapshot loop
|
||
│ ├── snapshot.ps1 ← UMDH single-shot
|
||
│ ├── activity-phases.json ← phase schedule template
|
||
│ ├── login.ahk ← AutoHotkey login skeleton
|
||
│ └── trace.cdb ← cdb scripting template
|
||
├── tools/
|
||
│ ├── check_exe_pdb.py ← verify binary ↔ PDB GUID match
|
||
│ ├── dump_pdb_info.py ← PDB metadata
|
||
│ └── pdb_extract.py ← regenerate symbols.json if needed
|
||
├── pdb/
|
||
│ └── acclient.pdb ← (29 MB, copied per MANIFEST)
|
||
└── references/
|
||
├── symbols.json ← 18,366 named functions + addresses (grep-friendly)
|
||
├── types.json ← 5,371 struct/class type definitions
|
||
├── acclient.h ← verbatim retail header structs
|
||
└── acclient_2013_pseudo_c.txt ← 64 MB symbolized Binary Ninja pseudo-C
|
||
```
|
||
|
||
The Python tools are stdlib-only (no pip). Everything else is data.
|
||
|
||
---
|
||
|
||
## 4. What you need on the VM (one-time, before starting)
|
||
|
||
If any of these is missing, **ask the user before guessing**.
|
||
|
||
| Component | Where | Notes |
|
||
|---|---|---|
|
||
| Retail AC client (2013 v11.4186) | `C:\Turbine\Asheron's Call\` | Standard install path. Verify match with `check_exe_pdb.py` before any other work. The `_NT_SYMBOL_PATH` must include `pdb/`. |
|
||
| Retail AC dat files | inside the install | `client_portal.dat`, `client_cell_1.dat`, `client_highres.dat`, `client_local_English.dat` |
|
||
| ACE server | `127.0.0.1:9000` on VM | Use ACEmulator from github.com/ACEmulator/ACE. Same config as user's dev box. Confirm it accepts logins before continuing. |
|
||
| Test character | on the VM's ACE | Suggested name: `+Leakhunt`. GM-marker `+` so debug commands are available. |
|
||
| Windows Debugging Tools | Microsoft Store WinDbg or Win10/11 SDK | Need `cdb.exe`, `umdh.exe`, `gflags.exe`. 32-bit (`x86`) versions — `acclient.exe` is 32-bit. |
|
||
| AutoHotkey v2 | autohotkey.com | For login automation. v2 only — templates assume v2 syntax. |
|
||
| Sysinternals `procdump` | sysinternals.com | Crash-dump capture. |
|
||
| MinHook (optional, for patch DLL) | github.com/TsudaKageyu/minhook | Only needed at Phase 8. Defer. |
|
||
| Shared folder or mounted drive | `Z:\` or similar | For passing snapshots back to host. Configure at VM-setup time. |
|
||
|
||
---
|
||
|
||
## 5. Configuration questions to ask the user at session start
|
||
|
||
**Ask these first, before running anything.** They materially affect
|
||
the harness.
|
||
|
||
1. **Where is ACE running** — same VM (recommended; snapshot-clean) or
|
||
on the host with VM networking through to it? Default assumption:
|
||
same VM.
|
||
2. **What's the AC install path** if it's not the standard
|
||
`C:\Turbine\Asheron's Call\`?
|
||
3. **Output flow** — shared folder path? Or push artifacts to a git
|
||
branch (e.g. `leak-hunt-vm/2026-05-12`)? Default: shared folder
|
||
to `Z:\leak-hunt\` on host.
|
||
4. **Test character name** on the VM ACE? Default: `+Leakhunt`.
|
||
5. **VM specs** — RAM and core count? (Affects whether to enable
|
||
gflags+UST from the start, which costs ~20–30% perf.)
|
||
6. **EoR binary location on host** — confirm the user has it at
|
||
`C:\Users\erikn\source\repos\acdream\refs\acclient-eor-2024-09-11.bndb`
|
||
(Binary Ninja db). This isn't needed on the VM but is critical for
|
||
Phase 7 BinDiff on the host.
|
||
7. **Wake-up cadence preference** — do they want you to use
|
||
`ScheduleWakeup` for hours-long gaps, or stay continuously
|
||
active? Default: ScheduleWakeup for any gap > 30 min.
|
||
|
||
Save the user's answers as memory entries before proceeding past
|
||
Phase 0 so a future session can pick up cold.
|
||
|
||
---
|
||
|
||
## 6. Phased plan
|
||
|
||
Each phase has a **goal**, **commands**, **decision gate**, and
|
||
**estimated time**. Don't skip ahead. Don't run multiple phases in
|
||
parallel until Phase 4.
|
||
|
||
### Phase 0 — Verify the bench (target: 30 min)
|
||
|
||
**Goal:** prove the environment can launch AC, log in, and observe
|
||
memory.
|
||
|
||
1. `py tools/check_exe_pdb.py "C:\Turbine\Asheron's Call\acclient.exe"`.
|
||
Expect: `=== MATCH: this exe pairs with our acclient.pdb ===`.
|
||
If MISMATCH → stop, ask the user which build they installed.
|
||
2. `py tools/dump_pdb_info.py pdb/acclient.pdb`. Confirm GUID
|
||
`9e847e2f-777c-4bd9-886c-22256bb87f32`, age 1.
|
||
3. Start ACE locally (`dotnet run` in the ACE checkout, or
|
||
`ACE.exe` if pre-built). Confirm it listens on `127.0.0.1:9000`.
|
||
4. Manually launch AC, log in with the test character, walk one
|
||
step, log out. **This proves the bench works before you add
|
||
instrumentation.**
|
||
5. Take a clean Hyper-V / VMware snapshot named `bench-verified`.
|
||
The supervisor will revert to this before each run.
|
||
|
||
**Decision gate:** can you launch, log in, walk, log out, clean?
|
||
If no, fix this before anything else. If yes, proceed.
|
||
|
||
---
|
||
|
||
### Phase 1 — Idle baseline + decide hunt platform (target: 4 hours)
|
||
|
||
**Goal:** does the leak reproduce on the 2013 client when the player
|
||
sits at the lifestone doing nothing? If yes, primary plan; if no,
|
||
Phase 2 will find the right activity profile.
|
||
|
||
1. Enable heap allocation tagging:
|
||
`gflags /i acclient.exe +ust`. This is registry-set; survives
|
||
reboots. (Disable later with `gflags /i acclient.exe -ust`.)
|
||
2. Set `_NT_SYMBOL_PATH=<vm-path-to-pdb-dir>`.
|
||
3. Launch AC via `templates/supervisor.ps1` (which sets env and
|
||
spawns the process). Log in manually OR via `templates/login.ahk`.
|
||
4. Walk to a quiet spot (lifestone interior, away from spawn). Sit.
|
||
5. `templates/snapshot.ps1 -ProcessId <pid> -Out snap_001.txt`
|
||
immediately. Wait 30 min. Take `snap_002`. Repeat for at least
|
||
4 hours (8 snapshots).
|
||
6. `umdh -d snap_001 snap_008 -f:diff_idle.txt`. Read top 20
|
||
growing stacks. Save the diff to `Z:\leak-hunt\phase1\`.
|
||
|
||
**Decision gate:**
|
||
- **Total committed memory grew >50 MB over 4h idle?** Leak repros at
|
||
idle. Skip Phase 2, jump to Phase 4 (long-soak idle).
|
||
- **Total committed grew 5–50 MB?** Leak may need amplification.
|
||
Proceed to Phase 2.
|
||
- **Total committed grew <5 MB?** Leak is activity-specific or
|
||
doesn't exist on 2013. Proceed to Phase 2.
|
||
- **Memory dropped or oscillates around 0?** No leak signal at idle.
|
||
Phase 2 is where you'll find it (or won't).
|
||
|
||
Record the baseline growth-rate number in memory:
|
||
`leak_hunt_phase1_baseline_mb_per_hour`.
|
||
|
||
---
|
||
|
||
### Phase 2 — Activity-phase characterization (target: 1–2 days)
|
||
|
||
**Goal:** find which player activity causes the leak. The bot is not
|
||
yet built — you drive this manually with the activity-phase template
|
||
running as an AHK macro, or by playing 30-min phases yourself if no
|
||
bot is available.
|
||
|
||
The five canonical phases (see `templates/activity-phases.json`):
|
||
|
||
1. **idle** — stand at lifestone, no input
|
||
2. **wander** — walk a fixed route around Holtburg
|
||
3. **chat** — spam say/tell/global chat
|
||
4. **target-cycle** — Tab through nearby NPCs/mobs, no combat
|
||
5. **ui-cycle** — open/close inventory, character pane, spells
|
||
|
||
**Procedure per session:**
|
||
1. Start fresh from `bench-verified` snapshot.
|
||
2. Run a single phase for 1 hour with snapshots every 15 min.
|
||
3. `umdh -d` diff the snapshot pair for that phase.
|
||
4. Record growth-rate per phase to memory.
|
||
5. Repeat for each phase, single VM, single phase per run.
|
||
|
||
(If user has authorized multiple parallel VMs, run different phases
|
||
simultaneously instead of sequentially.)
|
||
|
||
**Decision gate:** rank phases by growth rate. The top phase is your
|
||
target for Phase 4 amplification. Save ranking to memory.
|
||
|
||
---
|
||
|
||
### Phase 3 — Controller DLL (target: 1–2 days)
|
||
|
||
**Goal:** build a small DLL that drives the leaking phase
|
||
deterministically and reproducibly, faster than a human can.
|
||
|
||
**Build approach:**
|
||
- C++ DLL, 32-bit, compiled against Visual Studio Build Tools.
|
||
- MinHook for function hooking.
|
||
- LoadLibrary'd into `acclient.exe` via a small launcher EXE
|
||
(CreateProcess SUSPENDED → WriteProcessMemory of LoadLibrary
|
||
trampoline → ResumeThread). Standard injection pattern.
|
||
- Hook a frame-loop function from `symbols.json` — search
|
||
`references/symbols.json` for `CGameLoop`, `WorldFilter`, `Tick`,
|
||
`ProcessFrame` and pick the highest-frequency one with stable
|
||
signature.
|
||
- Call retail functions directly via PDB-resolved addresses. Examples:
|
||
`CPhysicsObj::set_velocity`, `CChatManager::SendSay`,
|
||
`CPlayerSystem::SelectTarget`. These take a `this` pointer in `ecx`
|
||
(thiscall) — you'll need a small asm trampoline or use
|
||
`__thiscall` calling-convention helpers.
|
||
|
||
**The bot's job:**
|
||
- Drive the top-ranked Phase 2 activity continuously.
|
||
- Emit a heartbeat to a log file every 30s so the supervisor can
|
||
detect wedging.
|
||
- Auto-restart self-position-watchdog: if `CPhysicsObj::position`
|
||
hasn't changed in 5 min during a movement phase, signal the
|
||
supervisor to revert and retry.
|
||
|
||
**Reuse opportunity:** the user maintains MosswartMassacre and
|
||
MosswartOverlord — both are AC client DLL-injection projects. **Ask
|
||
the user for read access** before designing from scratch; they may
|
||
have a working injector + MinHook scaffolding you can port from in
|
||
hours rather than days. Do not assume; ask.
|
||
|
||
**Decision gate:** bot runs the leaking phase for 1 hour
|
||
unattended, emits heartbeats, produces measurable UMDH growth.
|
||
|
||
---
|
||
|
||
### Phase 4 — Long-soak with amplification (target: 12–48 hours)
|
||
|
||
**Goal:** generate a clean signal — one or two leaking call stacks
|
||
visibly dominate the UMDH diff.
|
||
|
||
1. Revert VM to `bench-verified`.
|
||
2. Launch via `supervisor.ps1` with the controller DLL injected.
|
||
3. Snapshot every 15 min for 12+ hours.
|
||
4. `umdh -d` snap_001 vs snap_N every couple of hours during your
|
||
active turns. Between active turns, use `ScheduleWakeup` with
|
||
delay 1800–3600s and the reason `"long-soak snapshot check"`.
|
||
|
||
**Decision gate:** UMDH diff shows one or more call stacks with
|
||
monotonic growth across all adjacent-pair diffs, dominating the
|
||
total by ≥10× over the next-highest. That's your leak candidate(s).
|
||
|
||
---
|
||
|
||
### Phase 5 — Identify the leaking function (target: 2–4 hours)
|
||
|
||
**Goal:** convert the UMDH call stack into a named function we can
|
||
study and patch.
|
||
|
||
1. The top growing stack will look like:
|
||
```
|
||
ntdll!RtlAllocateHeap+0x...
|
||
acclient!operator new+0x...
|
||
acclient!CFoo::AllocateBar+0x42
|
||
acclient!CFoo::DoTheThing+0x18
|
||
acclient!CGameLoop::Tick+0x...
|
||
```
|
||
2. The named function is `CFoo::AllocateBar`. Grep
|
||
`references/acclient_2013_pseudo_c.txt` for `CFoo::AllocateBar`
|
||
to read its body.
|
||
3. Identify the paired free function (`CFoo::ReleaseBar`,
|
||
`~CFoo`, etc.) and confirm by reading both.
|
||
4. Find every call site of `CFoo::AllocateBar` (grep the pseudo-C
|
||
for the function name) and verify each has a matching paired
|
||
release. The one that doesn't is the bug.
|
||
|
||
**Decision gate:** you have (a) the leaking function name, (b) the
|
||
specific call site that doesn't free, (c) a hypothesis for the
|
||
patch (typically: add a `delete` or `Release()` on a specific code
|
||
path). Save these to memory + a write-up file.
|
||
|
||
---
|
||
|
||
### Phase 6 — Cross-reference with retail debugger trace (optional, target: 2 hours)
|
||
|
||
**Goal:** confirm the leak path is actually hit at runtime in a real
|
||
play scenario, not just statically possible.
|
||
|
||
This step is optional but recommended if the leak path is conditional
|
||
(e.g. "only when the chat buffer wraps"). Use the cdb workflow
|
||
documented in `templates/trace.cdb` and the retail-debugger section
|
||
in `CLAUDE.md`. Attach to acclient.exe, breakpoint on the alloc
|
||
function with a non-blocking action (`r $t0=@$t0+1; gc`), let it
|
||
accumulate for 30 min, count hits, correlate with UMDH growth bytes.
|
||
|
||
---
|
||
|
||
### Phase 7 — BinDiff to EoR (**HOST MACHINE — not VM**)
|
||
|
||
**Goal:** produce an EoR-binary signature and offset for the leaking
|
||
function.
|
||
|
||
This phase does not happen on the VM. The Binary Ninja databases
|
||
live on the host (`refs/acclient-eor-2024-09-11.bndb`,
|
||
`refs/acclient_2013-2024-09-11.bndb`).
|
||
|
||
**You (VM Claude) do:**
|
||
1. Write a structured handoff file
|
||
`Z:\leak-hunt\phase7-handoff.md` containing: the function name,
|
||
its 2013 RVA, the paired release function name, the suspected
|
||
missing-free call site, the call-graph context, a 32–48 byte
|
||
AOB signature with wildcards over relocatable operands (cite
|
||
the byte sequence from the pseudo-C/disassembly).
|
||
2. Notify the user that Phase 7 is ready.
|
||
|
||
**The user (or a Claude session on the host) does:**
|
||
3. Load both BNDBs in Binary Ninja, run BinDiff (or use BN's native
|
||
diff). Locate the matching function in EoR.
|
||
4. Verify the AOB signature still matches in EoR (small mods are
|
||
OK — adjust wildcards as needed).
|
||
5. Write back to `Z:\leak-hunt\phase7-result.md`: EoR RVA,
|
||
confirmed signature, any structural differences worth knowing.
|
||
|
||
You (VM Claude) resume once that file appears.
|
||
|
||
---
|
||
|
||
### Phase 8 — Patch DLL (target: 1 day)
|
||
|
||
**Goal:** ship a DLL that, when loaded into `acclient.exe` (any
|
||
build that matches the signature), plugs the leak.
|
||
|
||
- Same scaffold as the controller DLL — MinHook + a launcher EXE.
|
||
- Hook the leaking function (or its caller, whichever is cleanest).
|
||
- Wrap the existing logic so the missing free is performed on the
|
||
bug path.
|
||
- Provide a versioned filename and a small README so it's clear
|
||
which client build it targets.
|
||
- **Verify on the 2013 client first** — same UMDH soak, expect the
|
||
top growing stack to vanish.
|
||
|
||
---
|
||
|
||
### Phase 9 — Multi-day soak validation (target: 5+ days)
|
||
|
||
**Goal:** prove the patch fixes the production crash.
|
||
|
||
1. Install the patch DLL injector into the EoR client setup.
|
||
2. Launch under the supervisor with snapshots every hour (lower
|
||
freq — we're not hunting now, just confirming).
|
||
3. Run a controlled activity profile (the bot's full rotation) for
|
||
5+ days continuous.
|
||
4. **Pass:** no OOM crash, committed memory stable or decreasing
|
||
slope. **Fail:** crash before 5 days → back to Phase 5 to find
|
||
the second leak.
|
||
|
||
---
|
||
|
||
### Phase 10 — Ship
|
||
|
||
1. Write up findings:
|
||
- The leak's root cause (one paragraph)
|
||
- The patch's mechanism (one paragraph)
|
||
- The 2013-vs-EoR signature note
|
||
- Validation evidence (UMDH diffs, soak duration, growth-rate
|
||
plot if you have one)
|
||
2. Save the writeup to `Z:\leak-hunt\REPORT.md` and to memory.
|
||
3. Notify the user. Stop the loop. Done.
|
||
|
||
---
|
||
|
||
## 7. Wake-up protocol
|
||
|
||
Use `ScheduleWakeup` to self-pace between phases. **Default cadence
|
||
table:**
|
||
|
||
| Situation | Delay | Why |
|
||
|---|---|---|
|
||
| Active analysis (reading UMDH diffs, writing code) | none — stay engaged | Full-context work |
|
||
| Between snapshots inside a soak (Phase 1/4/9) | 1500–1800 s | Cache stays warm-ish, snapshots accumulate |
|
||
| Overnight gap (≥6 h) | 3600 s and chain | One cache miss is cheap vs. burning per-hour |
|
||
| Waiting for user (Phase 7 handoff) | 3600 s | Poll for the result file |
|
||
|
||
Pass the same loop prompt each turn. The `reason` field should
|
||
identify the phase and what you'll check (e.g. `"phase 4 snapshot
|
||
check — read snap_012 and diff against snap_001"`).
|
||
|
||
---
|
||
|
||
## 8. When to stop and ask the user
|
||
|
||
- **Phase 0 verification fails** (PDB mismatch, login fails, ACE
|
||
not reachable). Don't guess at fixes.
|
||
- **The bot wedges and auto-recovery fails twice in a row.**
|
||
- **You're about to expand scope** (refactor the supervisor into a
|
||
framework, build a UI for snapshot review, port code into
|
||
acdream's tree). Stop and ask. Default answer is no.
|
||
- **You hit a decision gate where the data is genuinely ambiguous**
|
||
(e.g. growth rate is moderate but no single stack dominates).
|
||
- **Phase 5 produces a function name that isn't in symbols.json.**
|
||
Probably means an indirect call or vtable dispatch — ask before
|
||
spending hours decoding it.
|
||
- **The patch in Phase 9 doesn't validate.** Don't iterate
|
||
indefinitely; surface findings and re-plan.
|
||
|
||
---
|
||
|
||
## 9. Memory protocol
|
||
|
||
Save findings as memory entries so a session that wakes 8 hours later
|
||
can resume cold. Specifically:
|
||
|
||
- `project_leak_hunt.md` — top-level project context, current phase,
|
||
open questions
|
||
- `leak_hunt_phase_N.md` — per-phase findings, growth rates, decisions
|
||
- `leak_hunt_candidate_<funcname>.md` — once a function is suspected,
|
||
everything you know about it
|
||
- `feedback_leak_hunt_<topic>.md` — if the user gives operational
|
||
feedback during this hunt, record it
|
||
|
||
Update `MEMORY.md` index entry for each. Keep entries short and
|
||
factual; long writeups go in `Z:\leak-hunt\` files referenced from
|
||
memory.
|
||
|
||
---
|
||
|
||
## 10. Hard rules (do not violate)
|
||
|
||
1. **Don't run anything that touches the user's host machine.** The
|
||
VM is isolated for a reason. All output goes through the shared
|
||
folder.
|
||
2. **Don't disable gflags+UST mid-run.** If you need to disable,
|
||
stop the supervisor, disable, take a fresh baseline.
|
||
3. **Don't modify `acclient.exe` on disk.** All patches are
|
||
runtime DLL hooks. If you ever feel tempted to binary-patch the
|
||
exe directly, ask the user first.
|
||
4. **Don't auto-update `MEMORY.md`** without first saving the
|
||
underlying memory file. The index must point at real files.
|
||
5. **Don't claim a leak is found** without the evidence checklist:
|
||
- ≥3 consecutive UMDH diffs showing the same top stack growing
|
||
- The stack is attributed to a named function in `symbols.json`
|
||
- The call site is identified in `acclient_2013_pseudo_c.txt`
|
||
- The hypothesis for the missing free is stated
|
||
6. **Don't proceed to Phase 8 without Phase 7 handoff complete.**
|
||
The patch must target the EoR signature, not the 2013 RVA.
|
||
|
||
---
|
||
|
||
## 11. First action when you start a fresh session
|
||
|
||
```
|
||
1. Read README.md (this file) end-to-end.
|
||
2. Read CLAUDE.md (project rules — concise).
|
||
3. Run the configuration questions in §5 by the user.
|
||
4. Save their answers as memory.
|
||
5. Begin Phase 0.
|
||
```
|
||
|
||
Good hunting.
|