# pdb-extract — pure-Python MSF 7.00 PDB extractor Reads `refs/acclient.pdb` (Sept 2013 EoR build, 28 MB) and writes two grep-friendly JSON sidecars to `docs/research/named-retail/`: - **`symbols.json`** — 18,366 named public function symbols from the PDB's S_PUB32 records. Each entry has `address` (image VA), `name` (MSVC-demangled `Class::Method` form), and `mangled` (raw C++ ABI symbol for callers that need exact mangling). - **`types.json`** — 5,371 unique named struct/class type records from the TPI stream (LF_CLASS / LF_STRUCTURE). Each entry has `name`, `size` (bytes), and `kind` (`class` or `struct`). ## Usage ```powershell py tools\pdb-extract\pdb_extract.py refs\acclient.pdb ``` Runs in <1 second. No external dependencies — uses Python stdlib only. ## Schema `symbols.json`: ```json [ { "address": "0x00594570", "name": "CEnchantmentRegistry::EnchantAttribute", "mangled": "?EnchantAttribute@CEnchantmentRegistry@@QBEHKAAK@Z" }, ... ] ``` `types.json`: ```json [ { "name": "CEnchantmentRegistry", "size": 32, "kind": "class" }, ... ] ``` ## Workflow integration The committed JSON sidecars are the named-retail counterpart to the `acclient_2013_pseudo_c.txt` text dump. Pseudo-C is for reading function bodies; `symbols.json` is for programmatic lookups. Use `jq` to query: ```bash # Find a function by exact name cat docs/research/named-retail/symbols.json | jq '.[] | select(.name == "CEnchantmentRegistry::EnchantAttribute")' # Find all functions on a class cat docs/research/named-retail/symbols.json | jq '.[] | select(.name | startswith("CACQualities::"))' # Reverse lookup by address (e.g. mid-body fix-up) cat docs/research/named-retail/symbols.json | jq '.[] | select(.address == "0x00594570")' # Find a type by name cat docs/research/named-retail/types.json | jq '.[] | select(.name == "Enchantment")' ``` ## Address mapping caveat The PDB is from the Sept 2013 EoR build. Addresses generally match the binary used to produce our `docs/research/decompiled/` Ghidra chunks within ~0xC00 bytes (different build runs of the same source revision). **When using `symbols.json` to correct entries in `acclient_function_map.md`, match by name, not by raw address.** ## Implementation notes The script is a self-contained MSF 7.00 reader. References used: - LLVM PDB documentation (https://llvm.org/docs/PDB/) — file format spec - Microsoft `pdbparse` (community) — implementation cross-check Streams consumed: - **3 (DBI)** — parses the header to extract the symbol-record stream index + the optional debug-header sub-stream's section-headers index. - **9 (section headers)** — parses `IMAGE_SECTION_HEADER` entries to build a section-base table for VA computation. - **8 (sym record stream)** — iterates records, picks `S_PUB32` with the `PUBSYM_FLAG_CODE` bit set, computes `VA = section_base + offset`. - **2 (TPI)** — iterates type records, picks `LF_CLASS` / `LF_STRUCTURE` that aren't forward-declared, parses size leaf + name. The MSVC name demangler (`_demangle`) is best-effort: handles the common `?Method@Class@Outer@@` patterns, constructors (`??0`), and destructors (`??1`). Returns the mangled string unchanged for operator overloads (`??2`, `??3`), vtables (`??_`), and other forms where a partial demangle would be misleading. Both `name` (demangled) and `mangled` (raw) are emitted in `symbols.json` so consumers can choose. ## When to regenerate - Whenever `refs/acclient.pdb` is updated (rare). - Whenever `pdb_extract.py` is changed (e.g. better demangler, more type info recovery). The output JSONs are committed because they're stable + small (~3 MB combined) and grep-faster than re-parsing the PDB on every session. ## Future work (out of scope here) The current `types.json` only carries name + size. A more ambitious version would walk LF_FIELDLIST records to recover field names + offsets + types — giving us a JSON-encoded `acclient.h`. Not done yet because `acclient.h` already exists committed at `docs/research/named-retail/acclient.h`. Consider this if a future panel needs offsetof() at runtime.