Initial commit — leak-hunt project complete
Five bugs identified and patched in retail Asheron's Call client: - v3b: palette refcount over-increment (3-byte NOP at two sites) - v5: RenderSurface PurgeResource no-op stub (vtable slot 2 thunk) - v11: two dangling-pointer crash guards (NULL-check + reorder) - v14: CEnvCell::Destroy ClipPlaneList leak (18-byte JMP to cleanup thunk) - v22: unpacker stale-pointer SEH guard (whole-function __try/__except) All five ship in leakfix.dll (117 KB, SHA d282f23c…) which is loaded by acclient.exe at process start via PE import table patching by tools/install_leakfix.py. Controlled 15-client fleet soak: unpatched control died at 26h with palette exhaustion; all 14 patched clients survived past that point and reached ≥5-day uptime. Residual ~15 MB/h growth traced to d3d9.dll's internal slab allocator (260KB surface backing buffers retained after Release). See REPORT.md §10 for the full investigation; conclusion is that it's unfixable from outside d3d9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
57b5e43d0e
199 changed files with 1648333 additions and 0 deletions
222
dll/leakfix/src/sweep_design.md
Normal file
222
dll/leakfix/src/sweep_design.md
Normal file
|
|
@ -0,0 +1,222 @@
|
|||
# Iter 4 — CPhysicsObj sweep design (DRAFT, NOT YET IMPLEMENTED)
|
||||
|
||||
## Goal
|
||||
|
||||
Periodically destroy abandoned CPhysicsObj instances to recover the
|
||||
residual leak documented in §6.1 of REPORT.md. **Highest-risk patch
|
||||
class** (physics-state mutation, same risk profile as v13 which
|
||||
killed Larsson at 98 min). Long soak per change is mandatory.
|
||||
|
||||
## What iter 3 told us
|
||||
|
||||
After 13 minutes on Unkle Leo (PID 16044), a typical scan shows:
|
||||
|
||||
```
|
||||
total=971 no_parent=546 no_cell=278 orphan_hash=697 both=234 triple=111
|
||||
```
|
||||
|
||||
So ~11% of all CPhysicsObj instances pass the strict triple predicate.
|
||||
On a fresh client triple count is ~100 (startup residual). Growth is
|
||||
+1-2 candidates per minute during normal play.
|
||||
|
||||
Strict-candidate sample dumps confirm:
|
||||
- `parent`, `cell`, `hash_next` all NULL ✓
|
||||
- `part_array` non-NULL (heap allocation that should be freed)
|
||||
- `shadow_objects.data` non-NULL (heap allocation that should be freed)
|
||||
- `state` has small bits set (e.g., 0x00000414 — normal active flags)
|
||||
|
||||
This matches the v17 owner-vtable diagnostic's "abandoned but heap state
|
||||
still allocated" pattern.
|
||||
|
||||
## Candidate destruction call
|
||||
|
||||
The engine already has correct teardown:
|
||||
|
||||
```c
|
||||
// EoR 0x005145D0 — CPhysicsObj::Destroy
|
||||
void __thiscall CPhysicsObj::Destroy(CPhysicsObj* this);
|
||||
```
|
||||
|
||||
Per the v17 owner-diag, `CPhysicsObj::Destroy` correctly tears down
|
||||
all owned heap state (`CPartArray::DestroyParts`, etc.). The leak is
|
||||
that it's never **called** on these abandoned objects.
|
||||
|
||||
After Destroy, the CPhysicsObj itself (~408 bytes) needs to be freed
|
||||
via `operator delete`.
|
||||
|
||||
## Predicate hardening (BEFORE we destroy anything)
|
||||
|
||||
The triple predicate may not be conservative enough. Additional
|
||||
checks before destroy:
|
||||
|
||||
1. **`update_time` is stale** — field at +0xD4 is a long double
|
||||
(timestamp). If less than `now() - 60s`, the object hasn't been
|
||||
touched in a minute. Compare via TimeGetTime() or similar global.
|
||||
2. **`state` is not "currently active"** — need to identify which
|
||||
bits indicate "being processed." For now, skip if state has any
|
||||
high bit set.
|
||||
3. **`weenie_obj == NULL`** — at +0x?? (need to verify offset).
|
||||
If a weenie-object still owns this physobj, the engine considers
|
||||
it alive even if other tracking is gone.
|
||||
4. **`movement_manager == NULL`** — at +0xC4 per acclient.h
|
||||
(LongHashData base 12 + ... + 0xB8 should be it). If there's an
|
||||
active mover, the object is in flight.
|
||||
5. **`hooks == NULL`** — at +0xE? — animation hooks pending.
|
||||
|
||||
The candidate must pass ALL these AND the iter-3 triple predicate.
|
||||
Stricter than iter 3.
|
||||
|
||||
## Safety protocol
|
||||
|
||||
1. **Throttle:** max 1 destruction per scan cycle (5 min). Even if 100
|
||||
candidates qualify, destroy ONE per scan. Surface latent bugs slowly.
|
||||
2. **Sample-first:** for the first 2 hours, LOG candidate addresses
|
||||
but do NOT destroy. Verify the candidates stay candidates over
|
||||
multiple scans (i.e., they're not transient).
|
||||
3. **Per-scan budget:** if a destruction succeeds, log address +
|
||||
pre-destroy field dump. If process crashes after, we have the last
|
||||
destroyed object for forensics.
|
||||
4. **Kill switch:** check `LEAKFIX_NO_SWEEP=1` env var at scan start.
|
||||
If set, skip destruction. Default ON (=destroy) once code lands.
|
||||
5. **Initial test target:** Unkle Leo (current designated guinea pig
|
||||
per CLAUDE.md). One client only. 4-hour soak before declaring safe.
|
||||
6. **Failure recovery:** if Unkle Leo crashes within 1 hour of
|
||||
destruction logic enabling, set the env var, restart with sweep
|
||||
disabled, mark iter-4 as failed in memory, do not retry without
|
||||
redesign.
|
||||
|
||||
## Implementation outline (when ready)
|
||||
|
||||
```cpp
|
||||
struct CPhysicsObj {
|
||||
void* vtable; // +0x00
|
||||
void* hash_next; // +0x04
|
||||
uint32_t id; // +0x08
|
||||
void* netblob_list; // +0x0C
|
||||
void* part_array; // +0x10
|
||||
// ... 12 bytes of player_vector/distance/CYpt
|
||||
void* sound_table; // +0x28
|
||||
uint32_t pad_exam; // +0x2C
|
||||
void* script_manager; // +0x30
|
||||
void* physics_script; // +0x34
|
||||
uint32_t default_script; // +0x38
|
||||
float script_intensity;// +0x3C
|
||||
void* parent; // +0x40
|
||||
void* children; // +0x44
|
||||
char position[72]; // +0x48
|
||||
void* cell; // +0x90
|
||||
uint32_t num_shadow; // +0x94
|
||||
char shadow_arr[16]; // +0x98 — DArray
|
||||
uint32_t state; // +0xA8
|
||||
uint32_t transient_state; // +0xAC
|
||||
// ... floats
|
||||
void* movement_manager;// +0xC4
|
||||
void* position_manager;// +0xC8
|
||||
int last_move_auto; // +0xCC
|
||||
int jumped_frame; // +0xD0
|
||||
double update_time; // +0xD4 (8 bytes)
|
||||
// ...
|
||||
void* weenie_obj; // +0x?? TBD
|
||||
};
|
||||
|
||||
typedef void (__fastcall *destroy_fn_t)(CPhysicsObj* self, void* edx);
|
||||
constexpr destroy_fn_t CPHYSICSOBJ_DESTROY = (destroy_fn_t)0x005145D0;
|
||||
constexpr void* OP_DELETE = (void*)0x005DF15E;
|
||||
|
||||
bool is_truly_abandoned(CPhysicsObj* p) {
|
||||
if (p->parent) return false;
|
||||
if (p->cell) return false;
|
||||
if (p->hash_next) return false;
|
||||
if (p->movement_manager) return false;
|
||||
// state mask: bits 0..15 are flags we tolerate; high bits suggest
|
||||
// active processing
|
||||
if ((p->state & 0xFFFF0000) != 0) return false;
|
||||
if (p->weenie_obj) return false; // need offset verified
|
||||
// update_time stale check
|
||||
double now = get_engine_time(); // need to find this — e.g., 0x????
|
||||
if (now - p->update_time < 60.0) return false;
|
||||
return true;
|
||||
}
|
||||
|
||||
void sweep_once() {
|
||||
if (env_skip_sweep()) return;
|
||||
// Walk all CPhysicsObj instances...
|
||||
CPhysicsObj* victim = nullptr;
|
||||
for (each CPhysicsObj p) {
|
||||
if (is_truly_abandoned(p)) { victim = p; break; } // ONLY ONE
|
||||
}
|
||||
if (!victim) return;
|
||||
|
||||
logf("SWEEP destroying CPhysicsObj @ 0x%p (state=0x%08x)", victim, victim->state);
|
||||
dump_physobj((uintptr_t)victim); // pre-destroy forensics
|
||||
__try {
|
||||
CPHYSICSOBJ_DESTROY(victim, 0);
|
||||
((void(__fastcall*)(void*, void*))OP_DELETE)(victim, 0);
|
||||
logf("SWEEP ok");
|
||||
} __except (EXCEPTION_EXECUTE_HANDLER) {
|
||||
logf("SWEEP exception — abandoning sweep this scan");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Known unknowns to resolve before coding
|
||||
|
||||
1. **Engine time global address** — for the stale-`update_time` check
|
||||
2. **`weenie_obj` offset** — need to read acclient.h carefully or sample dumps
|
||||
3. **State-bit meanings** — which bits indicate "in active processing"
|
||||
4. **Does `operator delete` of a CPhysicsObj that already had Destroy() called work?** —
|
||||
Destroy probably tears down state but may not free `this`.
|
||||
5. **What if the object is mid-iteration in some other code?** —
|
||||
destroying it would leave dangling iterators. Need to check the
|
||||
render loop / update loop doesn't have outstanding refs.
|
||||
|
||||
These are NOT minor — getting any wrong = v13-class crash.
|
||||
|
||||
## Recommended path
|
||||
|
||||
1. **Iter 4a (logging-only):** add the harder predicates (`movement_manager`,
|
||||
`weenie_obj`, `update_time` stale, state mask). Log candidate count
|
||||
passing the harder set. Compare to iter-3 triple count. If much
|
||||
smaller, predicates are stricter and we have higher confidence.
|
||||
2. **Iter 4b (sample-first):** dump 3 candidates that pass the hard
|
||||
set every scan. Verify they look genuinely abandoned across multiple
|
||||
scans.
|
||||
3. **Iter 4c (destroy 1 per hour, not per scan):** initial mutation
|
||||
test at the slowest possible rate. Soak 8h+ before declaring safe.
|
||||
4. **Iter 4d (destroy N per scan, where N = current candidate count):**
|
||||
only after 4c passes 24h soak.
|
||||
|
||||
This is a 3-day minimum process if everything goes right. If a v13-class
|
||||
crash happens anywhere, restart from 4a with a redesigned predicate.
|
||||
|
||||
## Decision gate
|
||||
|
||||
Per the soak data on Unkle Leo:
|
||||
- triple candidate growth: ~5/5min = 1/min
|
||||
- After 1 hour without sweep: ~60 abandoned physobjs added
|
||||
- After 24h: ~1440 abandoned
|
||||
- At ~1KB heap state per physobj: ~1.4 MB/day from this exact predicate
|
||||
|
||||
Compare to the agent's CObjCell-family estimate of 7-8 MB/hr. The
|
||||
triple subset is much smaller than the agent's total. The harder
|
||||
predicates will be smaller still.
|
||||
|
||||
**Question for the decision-maker (the human):** is recovering
|
||||
~1-2 MB/day per active client worth a v13-class risk? Given the
|
||||
project's 5-day soak target is already met without iter 4, **the
|
||||
honest answer is probably NO** — iter 4 buys marginal improvement
|
||||
at meaningful risk.
|
||||
|
||||
If the goal is 10-day uptime for heavy looters, iter 4 might help
|
||||
but the residual is dominated by other classes (CObjCell, gm*UI
|
||||
recycle pool, palette outside v3b's scope).
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Defer iter 4 indefinitely.** Iter 3 instrumentation gives us data
|
||||
to argue for or against. The DLL form's basic patches (v3b/v5/v11/v14)
|
||||
are what produces the soak win. Adding sweep is high-risk,
|
||||
low-marginal-reward.
|
||||
|
||||
Keep this document for future reference if a future analyst decides
|
||||
the residual leak warrants the risk.
|
||||
Loading…
Add table
Add a link
Reference in a new issue