security: enforce real plugin secret, fix proxy auth bypass, loopback DB ports, nightly backups

- SHARED_SECRET now read from env and fail-closed: unset/placeholder refuses
  ALL plugin connections (constant-time compare). The old hardcoded
  'your_shared_secret' in this public repo was no auth at all. Dockerfile
  default removed; generate_data.py reads the env var.
- SECRET_KEY fails closed at startup (main.py and agent/auth.py) instead of
  falling back to a publicly-known signing key; agent systemd unit now
  requires /etc/overlord/agent.env (no '-' prefix).
- AuthMiddleware + /ws/live: replace the 172.x source-IP trust (which every
  nginx-proxied internet request satisfied via docker-proxy — full session
  bypass and unauthenticated in-game command injection) with
  private-source AND no X-Forwarded-For, i.e. only genuinely internal
  callers (overlord-agent on the host, compose-network services). Invariant
  documented in nginx/overlord.conf: every tracker-bound location must set
  X-Forwarded-For.
- /character-stats/test endpoints gated behind admin (they upsert real rows).
- docker-compose: bind 5432/5433 to 127.0.0.1 (both DBs were internet-
  reachable; active brute-force observed in dereth-db logs).
- discord-rare-monitor: drop dead SHARED_SECRET constant.
- scripts/backup-databases.sh + docs/backups.md: nightly pg_dump of both DBs
  (telemetry/spawn hypertable data excluded), 10MB canary, umask 077,
  TimescaleDB restore procedure.
- Remove stray mangled-path css file from repo root.

Adversarially reviewed pre-deploy (3-lens workflow): ship verdict; deploy-
sequencing blockers addressed (secret staged before enforcement, exec bit
set, cron uses bash).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Erik 2026-06-10 17:02:47 +02:00
parent c6a1af0c39
commit a28b61511c
12 changed files with 261 additions and 2579 deletions

102
docs/backups.md Normal file
View file

@ -0,0 +1,102 @@
# Database backups
Nightly logical backups of both databases, taken by
[`scripts/backup-databases.sh`](../scripts/backup-databases.sh) via a cron
job on the live host (user `erik`, who is in the `docker` group — no sudo
needed). Install with:
```
mkdir -p /home/erik/backups # MUST exist before the first run —
# cron opens the log redirect before
# the script's own mkdir executes
crontab -e # add the line below
15 3 * * * bash /home/erik/MosswartOverlord/scripts/backup-databases.sh >> /home/erik/backups/backup.log 2>&1
```
Dumps land in `/home/erik/backups/postgres/` as `dereth-YYYYMMDD-HHMM.dump`
and `inventory-YYYYMMDD-HHMM.dump` (pg_dump custom format, compressed,
mode 0600). Retention: ~8 days of dailies (`-mtime +7`), pruned by the
script itself only after a successful run. The nightly `backup.log` will
contain pg_dump circular-FK warnings about hypertable chunks — those are
normal; the canary to watch is the printed dump sizes (a healthy dereth
dump is ~50 MB, and the script aborts if it drops below 10 MB).
## What is and isn't included
- **dereth** (TimescaleDB): everything EXCEPT the row data of the
`telemetry_events` and `spawn_events` hypertables (their chunk data in
`_timescaledb_internal._hyper_*` is excluded). That data is ~12 GB and
expires through retention policies within 730 days anyway. The
irreplaceable tables — `users`, `char_stats`, `rare_stats`,
`rare_stats_sessions`, `rare_events`, `combat_stats`,
`combat_stats_sessions`, `portals`, `character_stats`, `server_status`
are fully included. Table *schemas* for the excluded hypertables are
still dumped, so a restore recreates them empty.
- **inventory_db**: full dump (items, combat stats, enhancements, spells,
requirements, ratings, raw JSON).
⚠ The `_timescaledb_internal._hyper_*` exclusion drops the chunk data of
**every** hypertable, present and future. If an irreplaceable table is ever
converted to a hypertable (or a continuous aggregate is added), revisit the
exclusion list — otherwise its data silently disappears from backups.
## Off-host copies (recommended, not yet automated)
The dumps live on the same disk as the databases. Sync them off-host
periodically, e.g. from another machine:
```
rsync -av erik@overlord.snakedesert.se:backups/postgres/ ./overlord-backups/
```
## Restore
### inventory_db (plain Postgres)
```bash
docker exec -i inventory-db pg_restore -U inventory_user -d inventory_db --clean --if-exists < inventory-<stamp>.dump
```
### dereth (TimescaleDB — needs pre/post restore calls)
TimescaleDB requires putting the extension into restore mode around the
`pg_restore`, otherwise catalog rows fail:
```bash
# 1. Create a fresh DB (or use --clean against the existing one)
docker exec dereth-db psql -U postgres -c "CREATE DATABASE dereth_restore;"
docker exec dereth-db psql -U postgres -d dereth_restore -c "CREATE EXTENSION IF NOT EXISTS timescaledb;"
# 2. Pre-restore mode
docker exec dereth-db psql -U postgres -d dereth_restore -c "SELECT timescaledb_pre_restore();"
# 3. Restore the dump
docker exec -i dereth-db pg_restore -U postgres -d dereth_restore --no-owner < dereth-<stamp>.dump
# 4. Post-restore mode (re-enables background workers, validates catalog)
docker exec dereth-db psql -U postgres -d dereth_restore -c "SELECT timescaledb_post_restore();"
```
Notes:
- Step 3 reports one ignorable error — the dump's `CREATE EXTENSION
timescaledb` collides with the extension pre-created in step 1
("already exists", `errors ignored on restore: 1`). That is expected,
not a failed restore.
- The TimescaleDB **version** at restore time must be the **same** as at
dump time (restore first, then `ALTER EXTENSION timescaledb UPDATE` if
upgrading). Same-container restores with the image pinned in
docker-compose.yml (`timescale/timescaledb:2.19.3-pg14`) are fine.
Then either point `DATABASE_URL` at the restored DB or rename databases.
The `telemetry_events`/`spawn_events` hypertables come back empty (by
design); retention/compression policies are part of the dump and reattach.
## Verifying a backup
```bash
pg_restore --list dereth-<stamp>.dump | head # table of contents
pg_restore --list dereth-<stamp>.dump | grep -c 'TABLE DATA'
```
A dump that suddenly shrinks dramatically (check `backup.log` sizes) is the
canary for silent failure.