MosswartOverlord/docs/go-rewrite-prompt.md
Erik b8fd449d62 docs(go-prompt): inventory-service + discord bot are also Go rewrite targets
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-24 08:51:34 +02:00

11 KiB
Raw Blame History

Fresh-session prompt: rewrite the Overlord backend in Go (parallel run)

Paste everything below the line into a new Claude Code session started in C:\Users\erikn\source\repos\dereth-workspace.


You are starting a side project: rewrite the MosswartOverlord backend (currently Python/FastAPI) in Go, and deploy it in parallel with the live Python service so we can compare them on identical real traffic before cutting over. This is a strangler-fig migration, not a big-bang rewrite — the live Python service must keep running untouched the entire time.

Read these first (do not skip)

  • C:\Users\erikn\source\repos\dereth-workspace\CLAUDE.md — cross-repo overview, WebSocket event families, deploy, nginx, SSH.
  • MosswartOverlord\CLAUDE.md — backend specifics: components, WS endpoints + auth, DB, route conventions, deploy.
  • MosswartOverlord\README.md — HTTP API reference and architecture.
  • MosswartOverlord\main.py (~4200 lines) — the de-facto spec. The Pydantic models in it ARE the WebSocket payload schema. db_async.py is the DB schema (there are no alembic migrations; schema lives in code + idempotent DDL in init_db_async).
  • MosswartOverlord\nginx\overlord.conf — reverse-proxy layout.

What the system is (one paragraph)

"Dereth Tracker" ingests real-time telemetry from ~70 Asheron's Call game clients (a C# DECAL plugin, MosswartMassacre) over a WebSocket, persists to PostgreSQL/TimescaleDB, and serves a React dashboard (live map, player sidebar, stats, inventory search). A separate inventory-service (FastAPI + its own Postgres) handles item data. There's also a Discord rare bot and a host-side overlord-agent (shells out to claude — leave that alone).

Why Go (the actual motivation — don't lose sight of it)

The Python service runs a single uvicorn worker / single asyncio event loop, so it's capped at one CPU core and can't use the host's other cores (in-memory state — plugin connections, live snapshots — prevents multi-worker). Under load it saturated that core (telemetry processing lagged, the dashboard flickered). Go's value here is true multicore concurrency (goroutines + shared state via sync/channels) plus ~1050× cheaper per-message work. The win is the concurrency model, not raw speed — this is an I/O-bound service, so design for correctness and parallelism, not micro-optimization.

Scope

In scope — rewrite in Go. Three separate services, each independently deployable and parallel-testable:

  1. discord-rare-monitor (discord-rare-monitor/) — do this FIRST as the Go warm-up; it's the smallest and most isolated. A Discord bot that connects to the tracker's /ws/live (subscribes to rare/chat), classifies rares (the ~71-name common-rares list → common vs great channel), posts embeds to Discord, and relays allegiance chat. In Go: a coder/websocket client + bwmarrin/discordgo. Parallel test: run the Go copy against the same /ws/live but pointed at a TEST Discord channel (so it doesn't double-post to the real ones), and compare its output to the Python bot's.

  2. inventory-service (inventory-service/) — a separate FastAPI app with its own Postgres (inventory_db, container inventory-db, port 5433). Receives inventory payloads over HTTP from the tracker (POST /inventory/{char}/item, /process-inventory), does item enum translation (comprehensive_enum_database_v2.json) + DB writes, and serves item search + the suitbuilder constraint solver (suitbuilder.py — the heaviest piece; port carefully and validate against the Python solver's results). In Go: net/http + pgx. Parallel test: Go copy on a separate port with its own DB (or read-only against the same one); have the tracker tee inventory forwards to it; diff outputs.

  3. Main tracker (main.py) — the big one, do last: WS ingest /ws/position, browser WS /ws/live, the HTTP read API (/live, /trails, /stats/*, /total-rares, /total-kills, /character-stats/*, /quest-status, …), the 5s /live cache loop, persistence to TimescaleDB, and serving the React static/ bundle. Follow the phased parallel-run plan below.

Suggested order: (1) discord bot → (2) tracker read-side (Phase 1 below) → (3) inventory-service → (4) tracker ingest + cutover. The three services can also progress somewhat independently.

Out of scope (keep as-is):

  • The React frontend (frontend/) — it stays; the Go tracker serves the same built static/ bundle and implements the same API/WS contract. No frontend changes should be needed if the contract matches.
  • The overlord-agent (host-side, shells to claude) — leave in Python.
  • The DECAL plugin — do NOT change it. Go must speak the existing wire protocol.
  • The databases themselves — Go reuses the same PostgreSQL/TimescaleDB and inventory Postgres.

The parallel-run plan (this is the core of the project)

Run Go as a new container in the same docker-compose stack, on a new loopback port (e.g. 127.0.0.1:8770), reachable via a separate nginx path (e.g. https://overlord.snakedesert.se/go/) so it's testable side-by-side with the live Python app. Phases:

Phase 0 — scaffold. New Go module in a new directory (suggest MosswartOverlord/go-tracker/ or a sibling repo MosswartOverlord-go/). Dockerfile, compose service dereth-tracker-go (loopback-bound), nginx location /go/. Health endpoint. Deploy it doing nothing useful yet, confirm the plumbing.

Phase 1 — read-side parity (zero risk, do this first). Go connects read-only to the existing dereth TimescaleDB and reimplements the HTTP read API + serves the React bundle. Then compare Go vs Python on identical data: hit https://.../live (Python) and https://.../go/live (Go) and diff the JSON. They should match (semantically). This validates the read/serve half — which is most of the user-facing behavior — without touching ingest. Build a small comparison script and iterate until they match.

Phase 2 — ingest in shadow. Implement the plugin WebSocket ingest (/ws/position) and browser WS (/ws/live) in Go. To test ingest in parallel without stealing plugin connections or double-writing the live tables: have the Python tracker tee a copy of every received plugin message to the Go service (a small, low-risk addition to main.py — forward each raw message to Go over an internal channel/HTTP/WS), and have Go write to its own separate schema or database (e.g. a dereth_go DB) so you can compare ingest results against Python's without conflicts. Compare row counts, latencies, and /live outputs.

Phase 3 — the rest. Commands (browser→plugin envelopes), inventory forwarding to inventory-service, share_*, dungeon_map, combat_stats accumulation, Discord death/idle webhook, etc.

Phase 4 — cutover. Once Go matches Python on real traffic for long enough, flip nginx to route the real paths to Go, point the plugin endpoint at Go, retire the Python container. Keep Python deployable for rollback.

Contract & correctness facts you MUST preserve (learned the hard way)

  • Wire format: snake_case JSON, exact field names, events routed by a type field, ISO8601 UTC timestamps. The Pydantic models in main.py are the schema. Match them exactly or the plugin/frontend break.
  • /live "online" window MUST use the SERVER receive-time, not the client timestamp. Game machines' clocks drift up to ~90s apart; telemetry carries the client's DateTime.UtcNow. Python recently added a telemetry_events.received_at (server-stamped) column and windows "online" on COALESCE(received_at, timestamp) > now()-30s. Go must stamp its own server receive-time and window on that, or the player count flaps. (See the June 2026 fix; ACTIVE_WINDOW = 30s.)
  • Inventory deltas are a firehose — the plugin debounces "update" events to a 25 min randomized flush, but adds/removes are immediate, and forwards still arrive bursty. Python caps concurrent forwards to inventory-service with a semaphore(8) + a bounded httpx client. Go must similarly bound concurrency so an ingest burst can't starve telemetry.
  • Auth: browser endpoints use a session cookie signed with itsdangerous URLSafeTimedSerializer(SECRET_KEY) (HMAC, 30-day expiry). If Go reuses the same SECRET_KEY and replicates the format, the same login works on both during the parallel run — do that. Plugin /ws/position auth is an X-Plugin-Secret header vs SHARED_SECRET (env). ⚠ Currently the live deploy runs with SHARED_SECRET_LEGACY=your_shared_secret accepted (a migration escape hatch) — don't be surprised by the placeholder; read MosswartOverlord/CLAUDE.md "Integration contract".
  • Internal-trust rule: Python treats a request as internal (skips cookie auth) only if it comes from a private source IP and has no X-Forwarded-For (nginx adds XFF to all proxied traffic). Preserve this semantics; never trust the raw 172.x range.
  • DB: telemetry_events and spawn_events are TimescaleDB hypertables (partitioned on timestamp) with retention policies. There are NO migrations — schema is created in db_async.init_db_async. Read it for the exact tables/columns/indexes. Don't break the hypertable partition key (timestamp) — keep writing timestamp (client) for partitioning AND received_at (server) for the window.
  • Deploy reality: main.py/db_async.py/static/ are bind-mounted into the Python container (restart applies changes). The full-rebuild flow bakes a BUILD_VERSION for the UI version stamp. Postgres ports are bound to loopback; DB ports are NOT public. SSH: erik@overlord.snakedesert.se (key-based). Read-only DB: docker exec dereth-db psql -U postgres -d dereth.

Suggested Go stack (decide for yourself, but these fit)

  • HTTP: stdlib net/http + go-chi/chi router. WebSocket: coder/websocket (formerly nhooyr) or gorilla/websocket.
  • Postgres/TimescaleDB: jackc/pgx v5 + pgxpool. JSON: stdlib encoding/json (fine) or goccy/go-json if profiling says so. Logging: stdlib log/slog. Config: env vars matching the Python service.
  • Concurrency: shared in-memory state (live snapshots, plugin connections) behind sync.RWMutex or sharded maps; per-connection goroutines; bounded worker pools (golang.org/x/sync/semaphore or buffered channels) for inventory forwarding.

How to work

  • Evidence-driven and parallel-safe. Never disrupt the live Python service. Before claiming parity, diff the actual outputs against Python on real data and show the comparison.
  • Commit frequently. Keep the Go service in its own directory/repo. Don't touch main.py except the tiny Phase-2 tee (and even that, behind a flag).
  • Start by reading the docs above and main.py, then deliver Phase 0 + Phase 1 (a Go service deployed at /go/ that serves a /go/live matching Python's /live). Report the comparison.
  • Ask the user before: changing the live Python service, repointing the plugin endpoint, or any cutover step.

First, read the listed files and main.py, then propose your Phase 0/1 plan (Go module layout, the compose + nginx additions, and how you'll compare /go/live to /live) before writing code.