Pure-Python MSF 7.00 PDB extractor (no deps, stdlib only). Reads
refs/acclient.pdb directly:
- DBI stream (3) -> symbol record stream index + section header
stream index
- Section headers stream (9) -> per-segment image VA bases
- Symbol record stream (8) -> S_PUB32 records with image VAs
- TPI stream (2) -> LF_CLASS / LF_STRUCTURE named records (not
forward-declared), with size leaf + name
Includes a best-effort MSVC C++ demangler so symbols.json is
grep-friendly:
?EnchantAttribute@CEnchantmentRegistry@@QBEHKAAK@Z
-> CEnchantmentRegistry::EnchantAttribute
Both demangled `name` + raw `mangled` emitted per entry so callers
can choose. Operator overloads, vtables, and other special forms
where a partial demangle would be misleading are kept mangled.
Outputs committed to docs/research/named-retail/:
- symbols.json (2.9 MB) — 18,366 named public function symbols
- types.json (506 KB) — 5,371 unique named class/struct records
Spot check (matches discovery agent's earlier finding):
CEnchantmentRegistry::EnchantAttribute -> 0x00594570 ✓
Updated docs/research/acclient_function_map.md header preamble to
direct readers at the new symbols.json as the authoritative name
source; the hand-curated table stays as the cross-port (ACE/ACME)
index. Several addresses there are wrong vs the PDB and will be
swept in the issue #9 close (Phase E).
Closes #8 (filed in Phase D's commit). Foundation for the address
sweep + name-driven workflows from here on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
117 lines
4.1 KiB
Markdown
117 lines
4.1 KiB
Markdown
# pdb-extract — pure-Python MSF 7.00 PDB extractor
|
|
|
|
Reads `refs/acclient.pdb` (Sept 2013 EoR build, 28 MB) and writes two
|
|
grep-friendly JSON sidecars to `docs/research/named-retail/`:
|
|
|
|
- **`symbols.json`** — 18,366 named public function symbols from the
|
|
PDB's S_PUB32 records. Each entry has `address` (image VA), `name`
|
|
(MSVC-demangled `Class::Method` form), and `mangled` (raw C++ ABI
|
|
symbol for callers that need exact mangling).
|
|
- **`types.json`** — 5,371 unique named struct/class type records from
|
|
the TPI stream (LF_CLASS / LF_STRUCTURE). Each entry has `name`,
|
|
`size` (bytes), and `kind` (`class` or `struct`).
|
|
|
|
## Usage
|
|
|
|
```powershell
|
|
py tools\pdb-extract\pdb_extract.py refs\acclient.pdb
|
|
```
|
|
|
|
Runs in <1 second. No external dependencies — uses Python stdlib only.
|
|
|
|
## Schema
|
|
|
|
`symbols.json`:
|
|
```json
|
|
[
|
|
{
|
|
"address": "0x00594570",
|
|
"name": "CEnchantmentRegistry::EnchantAttribute",
|
|
"mangled": "?EnchantAttribute@CEnchantmentRegistry@@QBEHKAAK@Z"
|
|
},
|
|
...
|
|
]
|
|
```
|
|
|
|
`types.json`:
|
|
```json
|
|
[
|
|
{
|
|
"name": "CEnchantmentRegistry",
|
|
"size": 32,
|
|
"kind": "class"
|
|
},
|
|
...
|
|
]
|
|
```
|
|
|
|
## Workflow integration
|
|
|
|
The committed JSON sidecars are the named-retail counterpart to the
|
|
`acclient_2013_pseudo_c.txt` text dump. Pseudo-C is for reading
|
|
function bodies; `symbols.json` is for programmatic lookups. Use
|
|
`jq` to query:
|
|
|
|
```bash
|
|
# Find a function by exact name
|
|
cat docs/research/named-retail/symbols.json | jq '.[] | select(.name == "CEnchantmentRegistry::EnchantAttribute")'
|
|
|
|
# Find all functions on a class
|
|
cat docs/research/named-retail/symbols.json | jq '.[] | select(.name | startswith("CACQualities::"))'
|
|
|
|
# Reverse lookup by address (e.g. mid-body fix-up)
|
|
cat docs/research/named-retail/symbols.json | jq '.[] | select(.address == "0x00594570")'
|
|
|
|
# Find a type by name
|
|
cat docs/research/named-retail/types.json | jq '.[] | select(.name == "Enchantment")'
|
|
```
|
|
|
|
## Address mapping caveat
|
|
|
|
The PDB is from the Sept 2013 EoR build. Addresses generally match the
|
|
binary used to produce our `docs/research/decompiled/` Ghidra chunks
|
|
within ~0xC00 bytes (different build runs of the same source revision).
|
|
**When using `symbols.json` to correct entries in
|
|
`acclient_function_map.md`, match by name, not by raw address.**
|
|
|
|
## Implementation notes
|
|
|
|
The script is a self-contained MSF 7.00 reader. References used:
|
|
- LLVM PDB documentation (https://llvm.org/docs/PDB/) — file format spec
|
|
- Microsoft `pdbparse` (community) — implementation cross-check
|
|
|
|
Streams consumed:
|
|
- **3 (DBI)** — parses the header to extract the symbol-record stream
|
|
index + the optional debug-header sub-stream's section-headers index.
|
|
- **9 (section headers)** — parses `IMAGE_SECTION_HEADER` entries to
|
|
build a section-base table for VA computation.
|
|
- **8 (sym record stream)** — iterates records, picks `S_PUB32` with
|
|
the `PUBSYM_FLAG_CODE` bit set, computes `VA = section_base + offset`.
|
|
- **2 (TPI)** — iterates type records, picks `LF_CLASS` / `LF_STRUCTURE`
|
|
that aren't forward-declared, parses size leaf + name.
|
|
|
|
The MSVC name demangler (`_demangle`) is best-effort: handles the
|
|
common `?Method@Class@Outer@@<sig>` patterns, constructors (`??0`),
|
|
and destructors (`??1`). Returns the mangled string unchanged for
|
|
operator overloads (`??2`, `??3`), vtables (`??_`), and other forms
|
|
where a partial demangle would be misleading. Both `name` (demangled)
|
|
and `mangled` (raw) are emitted in `symbols.json` so consumers can
|
|
choose.
|
|
|
|
## When to regenerate
|
|
|
|
- Whenever `refs/acclient.pdb` is updated (rare).
|
|
- Whenever `pdb_extract.py` is changed (e.g. better demangler, more
|
|
type info recovery).
|
|
|
|
The output JSONs are committed because they're stable + small (~3 MB
|
|
combined) and grep-faster than re-parsing the PDB on every session.
|
|
|
|
## Future work (out of scope here)
|
|
|
|
The current `types.json` only carries name + size. A more ambitious
|
|
version would walk LF_FIELDLIST records to recover field names +
|
|
offsets + types — giving us a JSON-encoded `acclient.h`. Not done yet
|
|
because `acclient.h` already exists committed at
|
|
`docs/research/named-retail/acclient.h`. Consider this if a future
|
|
panel needs offsetof() at runtime.
|