Pure-docs sweep. Cross-checked 63 hand-curated entries in
acclient_function_map.md against docs/research/named-retail/symbols.json
(the PDB-derived authoritative name table) using the new helper at
tools/pdb-extract/check_function_map.py.
Findings:
- Zero entries matched address-and-name exactly. Confirms the
PDB build is from a different revision than the binary that
produced our Ghidra chunks (~0x800-0xC10 byte delta varies by
function cluster). Match by NAME, not by raw address.
- 38 entries corrected by PDB name lookup. The "Was" column
preserves the old address for traceability against existing
code comments. Old entries pointed mid-body of the actual
function; new column heads point to function starts.
- 25 entries have no PDB match. Either inlined / non-public
(no S_PUB32 record) or our hand-derived names were synthesized
from call-site analysis and don't match the MSVC mangled form
in the PDB. Several had wrong class assignments (e.g. 0x5387C0
claimed as CTransition::find_collisions, actually
CPolygon::polygon_hits_sphere). Flagged for re-derivation in
acclient_2013_pseudo_c.txt.
Pattern: kept the table format with two address columns (PDB +
legacy) so existing code references using the old addresses can
still be looked up. Added a sweep-summary section at the bottom of
the file documenting the methodology + findings.
Helper script at tools/pdb-extract/check_function_map.py is reusable
for future re-runs (re-run after every PDB regeneration / function
map edit).
Closes #9.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| check_function_map.py | ||
| pdb_extract.py | ||
| README.md | ||
pdb-extract — pure-Python MSF 7.00 PDB extractor
Reads refs/acclient.pdb (Sept 2013 EoR build, 28 MB) and writes two
grep-friendly JSON sidecars to docs/research/named-retail/:
symbols.json— 18,366 named public function symbols from the PDB's S_PUB32 records. Each entry hasaddress(image VA),name(MSVC-demangledClass::Methodform), andmangled(raw C++ ABI symbol for callers that need exact mangling).types.json— 5,371 unique named struct/class type records from the TPI stream (LF_CLASS / LF_STRUCTURE). Each entry hasname,size(bytes), andkind(classorstruct).
Usage
py tools\pdb-extract\pdb_extract.py refs\acclient.pdb
Runs in <1 second. No external dependencies — uses Python stdlib only.
Schema
symbols.json:
[
{
"address": "0x00594570",
"name": "CEnchantmentRegistry::EnchantAttribute",
"mangled": "?EnchantAttribute@CEnchantmentRegistry@@QBEHKAAK@Z"
},
...
]
types.json:
[
{
"name": "CEnchantmentRegistry",
"size": 32,
"kind": "class"
},
...
]
Workflow integration
The committed JSON sidecars are the named-retail counterpart to the
acclient_2013_pseudo_c.txt text dump. Pseudo-C is for reading
function bodies; symbols.json is for programmatic lookups. Use
jq to query:
# Find a function by exact name
cat docs/research/named-retail/symbols.json | jq '.[] | select(.name == "CEnchantmentRegistry::EnchantAttribute")'
# Find all functions on a class
cat docs/research/named-retail/symbols.json | jq '.[] | select(.name | startswith("CACQualities::"))'
# Reverse lookup by address (e.g. mid-body fix-up)
cat docs/research/named-retail/symbols.json | jq '.[] | select(.address == "0x00594570")'
# Find a type by name
cat docs/research/named-retail/types.json | jq '.[] | select(.name == "Enchantment")'
Address mapping caveat
The PDB is from the Sept 2013 EoR build. Addresses generally match the
binary used to produce our docs/research/decompiled/ Ghidra chunks
within ~0xC00 bytes (different build runs of the same source revision).
When using symbols.json to correct entries in
acclient_function_map.md, match by name, not by raw address.
Implementation notes
The script is a self-contained MSF 7.00 reader. References used:
- LLVM PDB documentation (https://llvm.org/docs/PDB/) — file format spec
- Microsoft
pdbparse(community) — implementation cross-check
Streams consumed:
- 3 (DBI) — parses the header to extract the symbol-record stream index + the optional debug-header sub-stream's section-headers index.
- 9 (section headers) — parses
IMAGE_SECTION_HEADERentries to build a section-base table for VA computation. - 8 (sym record stream) — iterates records, picks
S_PUB32with thePUBSYM_FLAG_CODEbit set, computesVA = section_base + offset. - 2 (TPI) — iterates type records, picks
LF_CLASS/LF_STRUCTUREthat aren't forward-declared, parses size leaf + name.
The MSVC name demangler (_demangle) is best-effort: handles the
common ?Method@Class@Outer@@<sig> patterns, constructors (??0),
and destructors (??1). Returns the mangled string unchanged for
operator overloads (??2, ??3), vtables (??_), and other forms
where a partial demangle would be misleading. Both name (demangled)
and mangled (raw) are emitted in symbols.json so consumers can
choose.
When to regenerate
- Whenever
refs/acclient.pdbis updated (rare). - Whenever
pdb_extract.pyis changed (e.g. better demangler, more type info recovery).
The output JSONs are committed because they're stable + small (~3 MB combined) and grep-faster than re-parsing the PDB on every session.
Future work (out of scope here)
The current types.json only carries name + size. A more ambitious
version would walk LF_FIELDLIST records to recover field names +
offsets + types — giving us a JSON-encoded acclient.h. Not done yet
because acclient.h already exists committed at
docs/research/named-retail/acclient.h. Consider this if a future
panel needs offsetof() at runtime.