9.5 KiB
Fresh-session prompt: rewrite the Overlord backend in Go (parallel run)
Paste everything below the line into a new Claude Code session started in
C:\Users\erikn\source\repos\dereth-workspace.
You are starting a side project: rewrite the MosswartOverlord backend (currently Python/FastAPI) in Go, and deploy it in parallel with the live Python service so we can compare them on identical real traffic before cutting over. This is a strangler-fig migration, not a big-bang rewrite — the live Python service must keep running untouched the entire time.
Read these first (do not skip)
C:\Users\erikn\source\repos\dereth-workspace\CLAUDE.md— cross-repo overview, WebSocket event families, deploy, nginx, SSH.MosswartOverlord\CLAUDE.md— backend specifics: components, WS endpoints + auth, DB, route conventions, deploy.MosswartOverlord\README.md— HTTP API reference and architecture.MosswartOverlord\main.py(~4200 lines) — the de-facto spec. The Pydantic models in it ARE the WebSocket payload schema.db_async.pyis the DB schema (there are no alembic migrations; schema lives in code + idempotent DDL ininit_db_async).MosswartOverlord\nginx\overlord.conf— reverse-proxy layout.
What the system is (one paragraph)
"Dereth Tracker" ingests real-time telemetry from ~70 Asheron's Call game clients (a C# DECAL plugin, MosswartMassacre) over a WebSocket, persists to PostgreSQL/TimescaleDB, and serves a React dashboard (live map, player sidebar, stats, inventory search). A separate inventory-service (FastAPI + its own Postgres) handles item data. There's also a Discord rare bot and a host-side overlord-agent (shells out to claude — leave that alone).
Why Go (the actual motivation — don't lose sight of it)
The Python service runs a single uvicorn worker / single asyncio event loop, so it's capped at one CPU core and can't use the host's other cores (in-memory state — plugin connections, live snapshots — prevents multi-worker). Under load it saturated that core (telemetry processing lagged, the dashboard flickered). Go's value here is true multicore concurrency (goroutines + shared state via sync/channels) plus ~10–50× cheaper per-message work. The win is the concurrency model, not raw speed — this is an I/O-bound service, so design for correctness and parallelism, not micro-optimization.
Scope
In scope (rewrite in Go):
- The main tracker (
main.py): WebSocket ingest (/ws/position), browser WebSocket (/ws/live), the HTTP read API (/live,/trails,/stats/*,/total-rares,/total-kills,/character-stats/*,/quest-status, etc.), the 5s/livecache loop, persistence to TimescaleDB, and serving the React static bundle. - Optionally later:
inventory-serviceanddiscord-rare-monitor(both smaller, good follow-ups).
Out of scope (keep as-is):
- The React frontend (
frontend/) — it stays; Go just serves the same builtstatic/bundle and implements the same API/WS contract. No frontend changes should be needed if the contract matches. - The overlord-agent (host-side, shells to
claude) — leave in Python. - The DECAL plugin — do NOT change it. Go must speak the existing wire protocol.
- The databases — Go uses the same PostgreSQL/TimescaleDB.
The parallel-run plan (this is the core of the project)
Run Go as a new container in the same docker-compose stack, on a new loopback port (e.g. 127.0.0.1:8770), reachable via a separate nginx path (e.g. https://overlord.snakedesert.se/go/) so it's testable side-by-side with the live Python app. Phases:
Phase 0 — scaffold. New Go module in a new directory (suggest MosswartOverlord/go-tracker/ or a sibling repo MosswartOverlord-go/). Dockerfile, compose service dereth-tracker-go (loopback-bound), nginx location /go/. Health endpoint. Deploy it doing nothing useful yet, confirm the plumbing.
Phase 1 — read-side parity (zero risk, do this first). Go connects read-only to the existing dereth TimescaleDB and reimplements the HTTP read API + serves the React bundle. Then compare Go vs Python on identical data: hit https://.../live (Python) and https://.../go/live (Go) and diff the JSON. They should match (semantically). This validates the read/serve half — which is most of the user-facing behavior — without touching ingest. Build a small comparison script and iterate until they match.
Phase 2 — ingest in shadow. Implement the plugin WebSocket ingest (/ws/position) and browser WS (/ws/live) in Go. To test ingest in parallel without stealing plugin connections or double-writing the live tables: have the Python tracker tee a copy of every received plugin message to the Go service (a small, low-risk addition to main.py — forward each raw message to Go over an internal channel/HTTP/WS), and have Go write to its own separate schema or database (e.g. a dereth_go DB) so you can compare ingest results against Python's without conflicts. Compare row counts, latencies, and /live outputs.
Phase 3 — the rest. Commands (browser→plugin envelopes), inventory forwarding to inventory-service, share_*, dungeon_map, combat_stats accumulation, Discord death/idle webhook, etc.
Phase 4 — cutover. Once Go matches Python on real traffic for long enough, flip nginx to route the real paths to Go, point the plugin endpoint at Go, retire the Python container. Keep Python deployable for rollback.
Contract & correctness facts you MUST preserve (learned the hard way)
- Wire format: snake_case JSON, exact field names, events routed by a
typefield, ISO8601 UTC timestamps. The Pydantic models inmain.pyare the schema. Match them exactly or the plugin/frontend break. /live"online" window MUST use the SERVER receive-time, not the client timestamp. Game machines' clocks drift up to ~90s apart; telemetry carries the client'sDateTime.UtcNow. Python recently added atelemetry_events.received_at(server-stamped) column and windows "online" onCOALESCE(received_at, timestamp) > now()-30s. Go must stamp its own server receive-time and window on that, or the player count flaps. (See the June 2026 fix;ACTIVE_WINDOW= 30s.)- Inventory deltas are a firehose — the plugin debounces "update" events to a 2–5 min randomized flush, but adds/removes are immediate, and forwards still arrive bursty. Python caps concurrent forwards to inventory-service with a semaphore(8) + a bounded httpx client. Go must similarly bound concurrency so an ingest burst can't starve telemetry.
- Auth: browser endpoints use a session cookie signed with
itsdangerousURLSafeTimedSerializer(SECRET_KEY)(HMAC, 30-day expiry). If Go reuses the sameSECRET_KEYand replicates the format, the same login works on both during the parallel run — do that. Plugin/ws/positionauth is anX-Plugin-Secretheader vsSHARED_SECRET(env). ⚠ Currently the live deploy runs withSHARED_SECRET_LEGACY=your_shared_secretaccepted (a migration escape hatch) — don't be surprised by the placeholder; readMosswartOverlord/CLAUDE.md"Integration contract". - Internal-trust rule: Python treats a request as internal (skips cookie auth) only if it comes from a private source IP and has no
X-Forwarded-For(nginx adds XFF to all proxied traffic). Preserve this semantics; never trust the raw 172.x range. - DB:
telemetry_eventsandspawn_eventsare TimescaleDB hypertables (partitioned ontimestamp) with retention policies. There are NO migrations — schema is created indb_async.init_db_async. Read it for the exact tables/columns/indexes. Don't break the hypertable partition key (timestamp) — keep writingtimestamp(client) for partitioning ANDreceived_at(server) for the window. - Deploy reality:
main.py/db_async.py/static/are bind-mounted into the Python container (restart applies changes). The full-rebuild flow bakes aBUILD_VERSIONfor the UI version stamp. Postgres ports are bound to loopback; DB ports are NOT public. SSH:erik@overlord.snakedesert.se(key-based). Read-only DB:docker exec dereth-db psql -U postgres -d dereth.
Suggested Go stack (decide for yourself, but these fit)
- HTTP: stdlib
net/http+go-chi/chirouter. WebSocket:coder/websocket(formerly nhooyr) orgorilla/websocket. - Postgres/TimescaleDB:
jackc/pgxv5 +pgxpool. JSON: stdlibencoding/json(fine) orgoccy/go-jsonif profiling says so. Logging: stdliblog/slog. Config: env vars matching the Python service. - Concurrency: shared in-memory state (live snapshots, plugin connections) behind
sync.RWMutexor sharded maps; per-connection goroutines; bounded worker pools (golang.org/x/sync/semaphoreor buffered channels) for inventory forwarding.
How to work
- Evidence-driven and parallel-safe. Never disrupt the live Python service. Before claiming parity, diff the actual outputs against Python on real data and show the comparison.
- Commit frequently. Keep the Go service in its own directory/repo. Don't touch
main.pyexcept the tiny Phase-2 tee (and even that, behind a flag). - Start by reading the docs above and
main.py, then deliver Phase 0 + Phase 1 (a Go service deployed at/go/that serves a/go/livematching Python's/live). Report the comparison. - Ask the user before: changing the live Python service, repointing the plugin endpoint, or any cutover step.
First, read the listed files and main.py, then propose your Phase 0/1 plan (Go module layout, the compose + nginx additions, and how you'll compare /go/live to /live) before writing code.