73 lines
11 KiB
Markdown
73 lines
11 KiB
Markdown
# Fresh-session prompt: rewrite the Overlord backend in Go (parallel run)
|
||
|
||
Paste everything below the line into a new Claude Code session started in
|
||
`C:\Users\erikn\source\repos\dereth-workspace`.
|
||
|
||
---
|
||
|
||
You are starting a side project: **rewrite the MosswartOverlord backend (currently Python/FastAPI) in Go**, and **deploy it in parallel with the live Python service** so we can compare them on identical real traffic before cutting over. This is a strangler-fig migration, not a big-bang rewrite — the live Python service must keep running untouched the entire time.
|
||
|
||
## Read these first (do not skip)
|
||
- `C:\Users\erikn\source\repos\dereth-workspace\CLAUDE.md` — cross-repo overview, WebSocket event families, deploy, nginx, SSH.
|
||
- `MosswartOverlord\CLAUDE.md` — backend specifics: components, WS endpoints + auth, DB, route conventions, deploy.
|
||
- `MosswartOverlord\README.md` — HTTP API reference and architecture.
|
||
- `MosswartOverlord\main.py` (~4200 lines) — the de-facto spec. The Pydantic models in it ARE the WebSocket payload schema. `db_async.py` is the DB schema (there are no alembic migrations; schema lives in code + idempotent DDL in `init_db_async`).
|
||
- `MosswartOverlord\nginx\overlord.conf` — reverse-proxy layout.
|
||
|
||
## What the system is (one paragraph)
|
||
"Dereth Tracker" ingests real-time telemetry from ~70 Asheron's Call game clients (a C# DECAL plugin, `MosswartMassacre`) over a WebSocket, persists to PostgreSQL/TimescaleDB, and serves a React dashboard (live map, player sidebar, stats, inventory search). A separate `inventory-service` (FastAPI + its own Postgres) handles item data. There's also a Discord rare bot and a host-side `overlord-agent` (shells out to `claude` — leave that alone).
|
||
|
||
## Why Go (the actual motivation — don't lose sight of it)
|
||
The Python service runs a **single uvicorn worker / single asyncio event loop**, so it's capped at one CPU core and can't use the host's other cores (in-memory state — plugin connections, live snapshots — prevents multi-worker). Under load it saturated that core (telemetry processing lagged, the dashboard flickered). Go's value here is **true multicore concurrency** (goroutines + shared state via `sync`/channels) plus ~10–50× cheaper per-message work. The win is the concurrency model, not raw speed — this is an I/O-bound service, so design for correctness and parallelism, not micro-optimization.
|
||
|
||
## Scope
|
||
**In scope — rewrite in Go. Three separate services, each independently deployable and parallel-testable:**
|
||
|
||
1. **discord-rare-monitor** (`discord-rare-monitor/`) — **do this FIRST as the Go warm-up; it's the smallest and most isolated.** A Discord bot that connects to the tracker's `/ws/live` (subscribes to `rare`/`chat`), classifies rares (the ~71-name common-rares list → common vs great channel), posts embeds to Discord, and relays allegiance chat. In Go: a `coder/websocket` client + `bwmarrin/discordgo`. Parallel test: run the Go copy against the same `/ws/live` but pointed at a **TEST Discord channel** (so it doesn't double-post to the real ones), and compare its output to the Python bot's.
|
||
|
||
2. **inventory-service** (`inventory-service/`) — a separate FastAPI app with its **own Postgres** (`inventory_db`, container `inventory-db`, port 5433). Receives inventory payloads over HTTP from the tracker (`POST /inventory/{char}/item`, `/process-inventory`), does item **enum translation** (`comprehensive_enum_database_v2.json`) + DB writes, and serves item search + the **suitbuilder constraint solver** (`suitbuilder.py` — the heaviest piece; port carefully and validate against the Python solver's results). In Go: `net/http` + `pgx`. Parallel test: Go copy on a separate port with its own DB (or read-only against the same one); have the tracker tee inventory forwards to it; diff outputs.
|
||
|
||
3. **Main tracker** (`main.py`) — the big one, do last: WS ingest `/ws/position`, browser WS `/ws/live`, the HTTP read API (`/live`, `/trails`, `/stats/*`, `/total-rares`, `/total-kills`, `/character-stats/*`, `/quest-status`, …), the 5s `/live` cache loop, persistence to TimescaleDB, and serving the React `static/` bundle. Follow the phased parallel-run plan below.
|
||
|
||
**Suggested order:** (1) discord bot → (2) tracker read-side (Phase 1 below) → (3) inventory-service → (4) tracker ingest + cutover. The three services can also progress somewhat independently.
|
||
|
||
**Out of scope (keep as-is):**
|
||
- The **React frontend** (`frontend/`) — it stays; the Go tracker serves the same built `static/` bundle and implements the same API/WS contract. No frontend changes should be needed if the contract matches.
|
||
- The **overlord-agent** (host-side, shells to `claude`) — leave in Python.
|
||
- The **DECAL plugin** — do NOT change it. Go must speak the existing wire protocol.
|
||
- The **databases themselves** — Go reuses the same PostgreSQL/TimescaleDB and inventory Postgres.
|
||
|
||
## The parallel-run plan (this is the core of the project)
|
||
Run Go as a **new container in the same docker-compose stack**, on a new loopback port (e.g. `127.0.0.1:8770`), reachable via a **separate nginx path** (e.g. `https://overlord.snakedesert.se/go/`) so it's testable side-by-side with the live Python app. Phases:
|
||
|
||
**Phase 0 — scaffold.** New Go module in a new directory (suggest `MosswartOverlord/go-tracker/` or a sibling repo `MosswartOverlord-go/`). Dockerfile, compose service `dereth-tracker-go` (loopback-bound), nginx `location /go/`. Health endpoint. Deploy it doing nothing useful yet, confirm the plumbing.
|
||
|
||
**Phase 1 — read-side parity (zero risk, do this first).** Go connects **read-only** to the existing `dereth` TimescaleDB and reimplements the HTTP read API + serves the React bundle. Then **compare Go vs Python on identical data**: hit `https://.../live` (Python) and `https://.../go/live` (Go) and diff the JSON. They should match (semantically). This validates the read/serve half — which is most of the user-facing behavior — without touching ingest. Build a small comparison script and iterate until they match.
|
||
|
||
**Phase 2 — ingest in shadow.** Implement the plugin WebSocket ingest (`/ws/position`) and browser WS (`/ws/live`) in Go. To test ingest in parallel **without stealing plugin connections or double-writing the live tables**: have the Python tracker **tee a copy** of every received plugin message to the Go service (a small, low-risk addition to `main.py` — forward each raw message to Go over an internal channel/HTTP/WS), and have **Go write to its own separate schema or database** (e.g. a `dereth_go` DB) so you can compare ingest results against Python's without conflicts. Compare row counts, latencies, and `/live` outputs.
|
||
|
||
**Phase 3 — the rest.** Commands (browser→plugin envelopes), inventory forwarding to inventory-service, share_*, dungeon_map, combat_stats accumulation, Discord death/idle webhook, etc.
|
||
|
||
**Phase 4 — cutover.** Once Go matches Python on real traffic for long enough, flip nginx to route the real paths to Go, point the plugin endpoint at Go, retire the Python container. Keep Python deployable for rollback.
|
||
|
||
## Contract & correctness facts you MUST preserve (learned the hard way)
|
||
- **Wire format:** snake_case JSON, exact field names, events routed by a `type` field, ISO8601 UTC timestamps. The Pydantic models in `main.py` are the schema. Match them exactly or the plugin/frontend break.
|
||
- **`/live` "online" window MUST use the SERVER receive-time, not the client timestamp.** Game machines' clocks drift up to ~90s apart; telemetry carries the client's `DateTime.UtcNow`. Python recently added a `telemetry_events.received_at` (server-stamped) column and windows "online" on `COALESCE(received_at, timestamp) > now()-30s`. Go must stamp its own server receive-time and window on that, or the player count flaps. (See the June 2026 fix; `ACTIVE_WINDOW` = 30s.)
|
||
- **Inventory deltas are a firehose** — the plugin debounces "update" events to a 2–5 min randomized flush, but adds/removes are immediate, and forwards still arrive bursty. Python caps concurrent forwards to inventory-service with a semaphore(8) + a bounded httpx client. Go must similarly bound concurrency so an ingest burst can't starve telemetry.
|
||
- **Auth:** browser endpoints use a session cookie signed with `itsdangerous` `URLSafeTimedSerializer(SECRET_KEY)` (HMAC, 30-day expiry). If Go reuses the same `SECRET_KEY` and replicates the format, the same login works on both during the parallel run — do that. Plugin `/ws/position` auth is an `X-Plugin-Secret` header vs `SHARED_SECRET` (env). ⚠ Currently the live deploy runs with `SHARED_SECRET_LEGACY=your_shared_secret` accepted (a migration escape hatch) — don't be surprised by the placeholder; read `MosswartOverlord/CLAUDE.md` "Integration contract".
|
||
- **Internal-trust rule:** Python treats a request as internal (skips cookie auth) only if it comes from a private source IP **and has no `X-Forwarded-For`** (nginx adds XFF to all proxied traffic). Preserve this semantics; never trust the raw 172.x range.
|
||
- **DB:** `telemetry_events` and `spawn_events` are TimescaleDB hypertables (partitioned on `timestamp`) with retention policies. There are NO migrations — schema is created in `db_async.init_db_async`. Read it for the exact tables/columns/indexes. Don't break the hypertable partition key (`timestamp`) — keep writing `timestamp` (client) for partitioning AND `received_at` (server) for the window.
|
||
- **Deploy reality:** `main.py`/`db_async.py`/`static/` are bind-mounted into the Python container (restart applies changes). The full-rebuild flow bakes a `BUILD_VERSION` for the UI version stamp. Postgres ports are bound to loopback; DB ports are NOT public. SSH: `erik@overlord.snakedesert.se` (key-based). Read-only DB: `docker exec dereth-db psql -U postgres -d dereth`.
|
||
|
||
## Suggested Go stack (decide for yourself, but these fit)
|
||
- HTTP: stdlib `net/http` + `go-chi/chi` router. WebSocket: `coder/websocket` (formerly nhooyr) or `gorilla/websocket`.
|
||
- Postgres/TimescaleDB: `jackc/pgx` v5 + `pgxpool`. JSON: stdlib `encoding/json` (fine) or `goccy/go-json` if profiling says so. Logging: stdlib `log/slog`. Config: env vars matching the Python service.
|
||
- Concurrency: shared in-memory state (live snapshots, plugin connections) behind `sync.RWMutex` or sharded maps; per-connection goroutines; bounded worker pools (`golang.org/x/sync/semaphore` or buffered channels) for inventory forwarding.
|
||
|
||
## How to work
|
||
- **Evidence-driven and parallel-safe.** Never disrupt the live Python service. Before claiming parity, *diff the actual outputs* against Python on real data and show the comparison.
|
||
- Commit frequently. Keep the Go service in its own directory/repo. Don't touch `main.py` except the tiny Phase-2 tee (and even that, behind a flag).
|
||
- Start by reading the docs above and `main.py`, then deliver **Phase 0 + Phase 1** (a Go service deployed at `/go/` that serves a `/go/live` matching Python's `/live`). Report the comparison.
|
||
- Ask the user before: changing the live Python service, repointing the plugin endpoint, or any cutover step.
|
||
|
||
First, read the listed files and `main.py`, then propose your Phase 0/1 plan (Go module layout, the compose + nginx additions, and how you'll compare `/go/live` to `/live`) before writing code.
|