acdream/tools/pdb-extract
Erik 69d884a3d6 tools(pdb-extract): #8 PDB -> symbols.json + types.json sidecar
Pure-Python MSF 7.00 PDB extractor (no deps, stdlib only). Reads
refs/acclient.pdb directly:
  - DBI stream (3) -> symbol record stream index + section header
    stream index
  - Section headers stream (9) -> per-segment image VA bases
  - Symbol record stream (8) -> S_PUB32 records with image VAs
  - TPI stream (2) -> LF_CLASS / LF_STRUCTURE named records (not
    forward-declared), with size leaf + name

Includes a best-effort MSVC C++ demangler so symbols.json is
grep-friendly:
  ?EnchantAttribute@CEnchantmentRegistry@@QBEHKAAK@Z
  -> CEnchantmentRegistry::EnchantAttribute

Both demangled `name` + raw `mangled` emitted per entry so callers
can choose. Operator overloads, vtables, and other special forms
where a partial demangle would be misleading are kept mangled.

Outputs committed to docs/research/named-retail/:
  - symbols.json (2.9 MB) — 18,366 named public function symbols
  - types.json (506 KB) — 5,371 unique named class/struct records

Spot check (matches discovery agent's earlier finding):
  CEnchantmentRegistry::EnchantAttribute -> 0x00594570 ✓

Updated docs/research/acclient_function_map.md header preamble to
direct readers at the new symbols.json as the authoritative name
source; the hand-curated table stays as the cross-port (ACE/ACME)
index. Several addresses there are wrong vs the PDB and will be
swept in the issue #9 close (Phase E).

Closes #8 (filed in Phase D's commit). Foundation for the address
sweep + name-driven workflows from here on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:31:52 +02:00
..
pdb_extract.py tools(pdb-extract): #8 PDB -> symbols.json + types.json sidecar 2026-04-25 17:31:52 +02:00
README.md tools(pdb-extract): #8 PDB -> symbols.json + types.json sidecar 2026-04-25 17:31:52 +02:00

pdb-extract — pure-Python MSF 7.00 PDB extractor

Reads refs/acclient.pdb (Sept 2013 EoR build, 28 MB) and writes two grep-friendly JSON sidecars to docs/research/named-retail/:

  • symbols.json — 18,366 named public function symbols from the PDB's S_PUB32 records. Each entry has address (image VA), name (MSVC-demangled Class::Method form), and mangled (raw C++ ABI symbol for callers that need exact mangling).
  • types.json — 5,371 unique named struct/class type records from the TPI stream (LF_CLASS / LF_STRUCTURE). Each entry has name, size (bytes), and kind (class or struct).

Usage

py tools\pdb-extract\pdb_extract.py refs\acclient.pdb

Runs in <1 second. No external dependencies — uses Python stdlib only.

Schema

symbols.json:

[
  {
    "address": "0x00594570",
    "name": "CEnchantmentRegistry::EnchantAttribute",
    "mangled": "?EnchantAttribute@CEnchantmentRegistry@@QBEHKAAK@Z"
  },
  ...
]

types.json:

[
  {
    "name": "CEnchantmentRegistry",
    "size": 32,
    "kind": "class"
  },
  ...
]

Workflow integration

The committed JSON sidecars are the named-retail counterpart to the acclient_2013_pseudo_c.txt text dump. Pseudo-C is for reading function bodies; symbols.json is for programmatic lookups. Use jq to query:

# Find a function by exact name
cat docs/research/named-retail/symbols.json | jq '.[] | select(.name == "CEnchantmentRegistry::EnchantAttribute")'

# Find all functions on a class
cat docs/research/named-retail/symbols.json | jq '.[] | select(.name | startswith("CACQualities::"))'

# Reverse lookup by address (e.g. mid-body fix-up)
cat docs/research/named-retail/symbols.json | jq '.[] | select(.address == "0x00594570")'

# Find a type by name
cat docs/research/named-retail/types.json | jq '.[] | select(.name == "Enchantment")'

Address mapping caveat

The PDB is from the Sept 2013 EoR build. Addresses generally match the binary used to produce our docs/research/decompiled/ Ghidra chunks within ~0xC00 bytes (different build runs of the same source revision). When using symbols.json to correct entries in acclient_function_map.md, match by name, not by raw address.

Implementation notes

The script is a self-contained MSF 7.00 reader. References used:

  • LLVM PDB documentation (https://llvm.org/docs/PDB/) — file format spec
  • Microsoft pdbparse (community) — implementation cross-check

Streams consumed:

  • 3 (DBI) — parses the header to extract the symbol-record stream index + the optional debug-header sub-stream's section-headers index.
  • 9 (section headers) — parses IMAGE_SECTION_HEADER entries to build a section-base table for VA computation.
  • 8 (sym record stream) — iterates records, picks S_PUB32 with the PUBSYM_FLAG_CODE bit set, computes VA = section_base + offset.
  • 2 (TPI) — iterates type records, picks LF_CLASS / LF_STRUCTURE that aren't forward-declared, parses size leaf + name.

The MSVC name demangler (_demangle) is best-effort: handles the common ?Method@Class@Outer@@<sig> patterns, constructors (??0), and destructors (??1). Returns the mangled string unchanged for operator overloads (??2, ??3), vtables (??_), and other forms where a partial demangle would be misleading. Both name (demangled) and mangled (raw) are emitted in symbols.json so consumers can choose.

When to regenerate

  • Whenever refs/acclient.pdb is updated (rare).
  • Whenever pdb_extract.py is changed (e.g. better demangler, more type info recovery).

The output JSONs are committed because they're stable + small (~3 MB combined) and grep-faster than re-parsing the PDB on every session.

Future work (out of scope here)

The current types.json only carries name + size. A more ambitious version would walk LF_FIELDLIST records to recover field names + offsets + types — giving us a JSON-encoded acclient.h. Not done yet because acclient.h already exists committed at docs/research/named-retail/acclient.h. Consider this if a future panel needs offsetof() at runtime.