leakhunt/dll/leakfix/stable/src.iter3/sweep_design.md

# Iter 4 — CPhysicsObj sweep design (DRAFT, NOT YET IMPLEMENTED)

## Goal

Periodically destroy abandoned CPhysicsObj instances to recover the
residual leak documented in §6.1 of REPORT.md. **Highest-risk patch
class** (physics-state mutation, same risk profile as v13 which
killed Larsson at 98 min). Long soak per change is mandatory.

## What iter 3 told us

After 13 minutes on Unkle Leo (PID 16044), a typical scan shows:

```
total=971 no_parent=546 no_cell=278 orphan_hash=697 both=234 triple=111
```

So ~11% of all CPhysicsObj instances pass the strict triple predicate.
On a fresh client triple count is ~100 (startup residual). Growth is
+1-2 candidates per minute during normal play.

Strict-candidate sample dumps confirm:
- `parent`, `cell`, `hash_next` all NULL ✓
- `part_array` non-NULL (heap allocation that should be freed)
- `shadow_objects.data` non-NULL (heap allocation that should be freed)
- `state` has small bits set (e.g., 0x00000414 — normal active flags)

This matches the v17 owner-vtable diagnostic's "abandoned but heap state
still allocated" pattern.

## Candidate destruction call

The engine already has correct teardown:

```c
// EoR 0x005145D0 — CPhysicsObj::Destroy
void __thiscall CPhysicsObj::Destroy(CPhysicsObj* this);
```

Per the v17 owner-diag, `CPhysicsObj::Destroy` correctly tears down
all owned heap state (`CPartArray::DestroyParts`, etc.). The leak is
that it's never **called** on these abandoned objects.

After Destroy, the CPhysicsObj itself (~408 bytes) needs to be freed
via `operator delete`.

## Predicate hardening (BEFORE we destroy anything)

The triple predicate may not be conservative enough. Additional
checks before destroy:

1. **`update_time` is stale** — field at +0xD4 is a long double
   (timestamp). If less than `now() - 60s`, the object hasn't been
   touched in a minute. Compare via TimeGetTime() or similar global.
2. **`state` is not "currently active"** — need to identify which
   bits indicate "being processed." For now, skip if state has any
   high bit set.
3. **`weenie_obj == NULL`** — at +0x?? (need to verify offset).
   If a weenie-object still owns this physobj, the engine considers
   it alive even if other tracking is gone.
4. **`movement_manager == NULL`** — at +0xC4 per acclient.h
   (LongHashData base 12 + ... + 0xB8 should be it). If there's an
   active mover, the object is in flight.
5. **`hooks == NULL`** — at +0xE? — animation hooks pending.

The candidate must pass ALL these AND the iter-3 triple predicate.
Stricter than iter 3.

## Safety protocol

1. **Throttle:** max 1 destruction per scan cycle (5 min). Even if 100
   candidates qualify, destroy ONE per scan. Surface latent bugs slowly.
2. **Sample-first:** for the first 2 hours, LOG candidate addresses
   but do NOT destroy. Verify the candidates stay candidates over
   multiple scans (i.e., they're not transient).
3. **Per-scan budget:** if a destruction succeeds, log address +
   pre-destroy field dump. If process crashes after, we have the last
   destroyed object for forensics.
4. **Kill switch:** check `LEAKFIX_NO_SWEEP=1` env var at scan start.
   If set, skip destruction. Default ON (=destroy) once code lands.
5. **Initial test target:** Unkle Leo (current designated guinea pig
   per CLAUDE.md). One client only. 4-hour soak before declaring safe.
6. **Failure recovery:** if Unkle Leo crashes within 1 hour of
   destruction logic enabling, set the env var, restart with sweep
   disabled, mark iter-4 as failed in memory, do not retry without
   redesign.

## Implementation outline (when ready)

```cpp
struct CPhysicsObj {
    void*       vtable;          // +0x00
    void*       hash_next;       // +0x04
    uint32_t    id;              // +0x08
    void*       netblob_list;    // +0x0C
    void*       part_array;      // +0x10
    // ... 12 bytes of player_vector/distance/CYpt
    void*       sound_table;     // +0x28
    uint32_t    pad_exam;        // +0x2C
    void*       script_manager;  // +0x30
    void*       physics_script;  // +0x34
    uint32_t    default_script;  // +0x38
    float       script_intensity;// +0x3C
    void*       parent;          // +0x40
    void*       children;        // +0x44
    char        position[72];    // +0x48
    void*       cell;            // +0x90
    uint32_t    num_shadow;      // +0x94
    char        shadow_arr[16];  // +0x98 — DArray
    uint32_t    state;           // +0xA8
    uint32_t    transient_state; // +0xAC
    // ... floats
    void*       movement_manager;// +0xC4
    void*       position_manager;// +0xC8
    int         last_move_auto;  // +0xCC
    int         jumped_frame;    // +0xD0
    double      update_time;     // +0xD4 (8 bytes)
    // ...
    void*       weenie_obj;      // +0x??  TBD
};

typedef void (__fastcall *destroy_fn_t)(CPhysicsObj* self, void* edx);
constexpr destroy_fn_t CPHYSICSOBJ_DESTROY = (destroy_fn_t)0x005145D0;
constexpr void* OP_DELETE = (void*)0x005DF15E;

bool is_truly_abandoned(CPhysicsObj* p) {
    if (p->parent) return false;
    if (p->cell) return false;
    if (p->hash_next) return false;
    if (p->movement_manager) return false;
    // state mask: bits 0..15 are flags we tolerate; high bits suggest
    // active processing
    if ((p->state & 0xFFFF0000) != 0) return false;
    if (p->weenie_obj) return false;  // need offset verified
    // update_time stale check
    double now = get_engine_time();   // need to find this — e.g., 0x????
    if (now - p->update_time < 60.0) return false;
    return true;
}

void sweep_once() {
    if (env_skip_sweep()) return;
    // Walk all CPhysicsObj instances...
    CPhysicsObj* victim = nullptr;
    for (each CPhysicsObj p) {
        if (is_truly_abandoned(p)) { victim = p; break; }  // ONLY ONE
    }
    if (!victim) return;

    logf("SWEEP destroying CPhysicsObj @ 0x%p (state=0x%08x)", victim, victim->state);
    dump_physobj((uintptr_t)victim);  // pre-destroy forensics
    __try {
        CPHYSICSOBJ_DESTROY(victim, 0);
        ((void(__fastcall*)(void*, void*))OP_DELETE)(victim, 0);
        logf("SWEEP ok");
    } __except (EXCEPTION_EXECUTE_HANDLER) {
        logf("SWEEP exception — abandoning sweep this scan");
    }
}
```

## Known unknowns to resolve before coding

1. **Engine time global address** — for the stale-`update_time` check
2. **`weenie_obj` offset** — need to read acclient.h carefully or sample dumps
3. **State-bit meanings** — which bits indicate "in active processing"
4. **Does `operator delete` of a CPhysicsObj that already had Destroy() called work?** —
   Destroy probably tears down state but may not free `this`.
5. **What if the object is mid-iteration in some other code?** —
   destroying it would leave dangling iterators. Need to check the
   render loop / update loop doesn't have outstanding refs.

These are NOT minor — getting any wrong = v13-class crash.

## Recommended path

1. **Iter 4a (logging-only):** add the harder predicates (`movement_manager`,
   `weenie_obj`, `update_time` stale, state mask). Log candidate count
   passing the harder set. Compare to iter-3 triple count. If much
   smaller, predicates are stricter and we have higher confidence.
2. **Iter 4b (sample-first):** dump 3 candidates that pass the hard
   set every scan. Verify they look genuinely abandoned across multiple
   scans.
3. **Iter 4c (destroy 1 per hour, not per scan):** initial mutation
   test at the slowest possible rate. Soak 8h+ before declaring safe.
4. **Iter 4d (destroy N per scan, where N = current candidate count):**
   only after 4c passes 24h soak.

This is a 3-day minimum process if everything goes right. If a v13-class
crash happens anywhere, restart from 4a with a redesigned predicate.

## Decision gate

Per the soak data on Unkle Leo:
- triple candidate growth: ~5/5min = 1/min
- After 1 hour without sweep: ~60 abandoned physobjs added
- After 24h: ~1440 abandoned
- At ~1KB heap state per physobj: ~1.4 MB/day from this exact predicate

Compare to the agent's CObjCell-family estimate of 7-8 MB/hr. The
triple subset is much smaller than the agent's total. The harder
predicates will be smaller still.

**Question for the decision-maker (the human):** is recovering
~1-2 MB/day per active client worth a v13-class risk? Given the
project's 5-day soak target is already met without iter 4, **the
honest answer is probably NO** — iter 4 buys marginal improvement
at meaningful risk.

If the goal is 10-day uptime for heavy looters, iter 4 might help
but the residual is dominated by other classes (CObjCell, gm*UI
recycle pool, palette outside v3b's scope).

## Recommendation

**Defer iter 4 indefinitely.** Iter 3 instrumentation gives us data
to argue for or against. The DLL form's basic patches (v3b/v5/v11/v14)
are what produces the soak win. Adding sweep is high-risk,
low-marginal-reward.

Keep this document for future reference if a future analyst decides
the residual leak warrants the risk.