security: enforce real plugin secret, fix proxy auth bypass, loopback DB ports, nightly backups

- SHARED_SECRET now read from env and fail-closed: unset/placeholder refuses
  ALL plugin connections (constant-time compare). The old hardcoded
  'your_shared_secret' in this public repo was no auth at all. Dockerfile
  default removed; generate_data.py reads the env var.
- SECRET_KEY fails closed at startup (main.py and agent/auth.py) instead of
  falling back to a publicly-known signing key; agent systemd unit now
  requires /etc/overlord/agent.env (no '-' prefix).
- AuthMiddleware + /ws/live: replace the 172.x source-IP trust (which every
  nginx-proxied internet request satisfied via docker-proxy — full session
  bypass and unauthenticated in-game command injection) with
  private-source AND no X-Forwarded-For, i.e. only genuinely internal
  callers (overlord-agent on the host, compose-network services). Invariant
  documented in nginx/overlord.conf: every tracker-bound location must set
  X-Forwarded-For.
- /character-stats/test endpoints gated behind admin (they upsert real rows).
- docker-compose: bind 5432/5433 to 127.0.0.1 (both DBs were internet-
  reachable; active brute-force observed in dereth-db logs).
- discord-rare-monitor: drop dead SHARED_SECRET constant.
- scripts/backup-databases.sh + docs/backups.md: nightly pg_dump of both DBs
  (telemetry/spawn hypertable data excluded), 10MB canary, umask 077,
  TimescaleDB restore procedure.
- Remove stray mangled-path css file from repo root.

Adversarially reviewed pre-deploy (3-lens workflow): ship verdict; deploy-
sequencing blockers addressed (secret staged before enforcement, exec bit
set, cron uses bash).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Erik 2026-06-10 17:02:47 +02:00
parent c6a1af0c39
commit a28b61511c
12 changed files with 261 additions and 2579 deletions

View file

@ -42,7 +42,7 @@ Dereth Tracker is a real-time telemetry platform for Asheron's Call world tracki
- Connection pool: `min_size=5, max_size=100, command_timeout=120` (`db_async.py:21`). Postgres `max_connections` is the default 100, shared with Grafana and the agent's read-only role — don't widen the pool further.
- Persisted event types: telemetry, spawn, rare, portal, character_stats, combat_stats. Everything else (vitals, quest, cantrips, nearby_objects, dungeon_map, share_*) is memory-only.
- Read-only agent role `overlord_agent_ro` is provisioned manually via `agent/sql/0001_overlord_agent_ro.sql` (SELECT-only).
- There is **no backup mechanism** — durability is the two Docker volumes (`timescale-data`, `inventory-data`).
- Backups: nightly cron on the host runs `scripts/backup-databases.sh` (pg_dump both DBs to `/home/erik/backups/postgres/`, 7-day retention; telemetry/spawn hypertable data deliberately excluded). Restore procedure: `docs/backups.md` — TimescaleDB needs `timescaledb_pre_restore()/post_restore()`.
- `db.py` is a dead legacy SQLite layer — nothing imports it. All persistence goes through `db_async.py`.
## Route conventions

File diff suppressed because it is too large Load diff

View file

@ -36,13 +36,14 @@ ARG BUILD_VERSION=dev
ENV APP_VERSION=$BUILD_VERSION
## Default environment variables for application configuration
## NOTE: no SHARED_SECRET default here on purpose — main.py fails closed
## (refuses plugin connections) unless a real value arrives via compose/.env.
ENV DATABASE_URL=postgresql://postgres:password@db:5432/dereth \
DB_MAX_SIZE_MB=2048 \
DB_RETENTION_DAYS=7 \
DB_MAX_SQL_LENGTH=1000000000 \
DB_MAX_SQL_VARIABLES=32766 \
DB_WAL_AUTOCHECKPOINT_PAGES=1000 \
SHARED_SECRET=your_shared_secret
DB_WAL_AUTOCHECKPOINT_PAGES=1000
## Launch the FastAPI app using Uvicorn
CMD ["uvicorn","main:app","--host","0.0.0.0","--port","8765","--workers","1","--no-access-log","--log-level","warning"]

View file

@ -12,8 +12,15 @@ import os
from fastapi import HTTPException, Request, status
from itsdangerous import BadSignature, SignatureExpired, URLSafeTimedSerializer
# Mirror main.py:996-998
SECRET_KEY = os.getenv("SECRET_KEY", "change-me-in-production-please")
# Mirror main.py — and fail closed like it does: starting with a known
# default key would let anyone forge a valid session cookie.
SECRET_KEY = os.getenv("SECRET_KEY", "")
if not SECRET_KEY or SECRET_KEY == "change-me-in-production-please":
raise RuntimeError(
"SECRET_KEY env var must be set (shared with dereth-tracker; see "
"/etc/overlord/agent.env) — refusing to start with a forgeable "
"session-signing key"
)
SESSION_MAX_AGE = 30 * 24 * 3600 # 30 days
_serializer = URLSafeTimedSerializer(SECRET_KEY)

View file

@ -20,8 +20,10 @@ WorkingDirectory=/home/erik/MosswartOverlord
# HOME explicitly set so claude reads /var/lib/overlord-agent/.claude/*
# instead of trying /home/erik/.claude/* (which is now 0700, locked out).
Environment="HOME=/var/lib/overlord-agent"
# Secrets file (root:overlord-agent 0640).
EnvironmentFile=-/etc/overlord/agent.env
# Secrets file (root:overlord-agent 0640). REQUIRED (no leading '-'):
# a missing secrets file must abort startup, not fail open — auth.py also
# refuses to start without SECRET_KEY.
EnvironmentFile=/etc/overlord/agent.env
# Run inside the venv populated by install.sh.
ExecStart=/home/erik/MosswartOverlord/agent/.venv/bin/python -m agent.service
Restart=on-failure

View file

@ -34,7 +34,6 @@ logger = logging.getLogger(__name__)
# Configuration from environment variables
DISCORD_TOKEN = os.getenv('DISCORD_RARE_BOT_TOKEN')
WEBSOCKET_URL = os.getenv('DERETH_TRACKER_WS_URL', 'ws://dereth-tracker:8765/ws/live')
SHARED_SECRET = 'your_shared_secret'
ACLOG_CHANNEL_ID = int(os.getenv('ACLOG_CHANNEL_ID', '1349649482786275328'))
COMMON_RARE_CHANNEL_ID = int(os.getenv('COMMON_RARE_CHANNEL_ID', '1355328792184226014'))
GREAT_RARE_CHANNEL_ID = int(os.getenv('GREAT_RARE_CHANNEL_ID', '1353676584334131211'))

View file

@ -62,7 +62,11 @@ services:
volumes:
- timescale-data:/var/lib/postgresql/data
ports:
- "5432:5432"
# Loopback only — Docker-published ports bypass ufw, and this host is
# internet-facing (active brute-force on the open port observed June
# 2026). In-stack consumers use the compose network; host-side tools
# (psql, overlord-agent) use 127.0.0.1.
- "127.0.0.1:5432:5432"
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
@ -104,7 +108,8 @@ services:
volumes:
- inventory-data:/var/lib/postgresql/data
ports:
- "5433:5432"
# Loopback only — see db service note.
- "127.0.0.1:5433:5432"
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U inventory_user"]

102
docs/backups.md Normal file
View file

@ -0,0 +1,102 @@
# Database backups
Nightly logical backups of both databases, taken by
[`scripts/backup-databases.sh`](../scripts/backup-databases.sh) via a cron
job on the live host (user `erik`, who is in the `docker` group — no sudo
needed). Install with:
```
mkdir -p /home/erik/backups # MUST exist before the first run —
# cron opens the log redirect before
# the script's own mkdir executes
crontab -e # add the line below
15 3 * * * bash /home/erik/MosswartOverlord/scripts/backup-databases.sh >> /home/erik/backups/backup.log 2>&1
```
Dumps land in `/home/erik/backups/postgres/` as `dereth-YYYYMMDD-HHMM.dump`
and `inventory-YYYYMMDD-HHMM.dump` (pg_dump custom format, compressed,
mode 0600). Retention: ~8 days of dailies (`-mtime +7`), pruned by the
script itself only after a successful run. The nightly `backup.log` will
contain pg_dump circular-FK warnings about hypertable chunks — those are
normal; the canary to watch is the printed dump sizes (a healthy dereth
dump is ~50 MB, and the script aborts if it drops below 10 MB).
## What is and isn't included
- **dereth** (TimescaleDB): everything EXCEPT the row data of the
`telemetry_events` and `spawn_events` hypertables (their chunk data in
`_timescaledb_internal._hyper_*` is excluded). That data is ~12 GB and
expires through retention policies within 730 days anyway. The
irreplaceable tables — `users`, `char_stats`, `rare_stats`,
`rare_stats_sessions`, `rare_events`, `combat_stats`,
`combat_stats_sessions`, `portals`, `character_stats`, `server_status`
are fully included. Table *schemas* for the excluded hypertables are
still dumped, so a restore recreates them empty.
- **inventory_db**: full dump (items, combat stats, enhancements, spells,
requirements, ratings, raw JSON).
⚠ The `_timescaledb_internal._hyper_*` exclusion drops the chunk data of
**every** hypertable, present and future. If an irreplaceable table is ever
converted to a hypertable (or a continuous aggregate is added), revisit the
exclusion list — otherwise its data silently disappears from backups.
## Off-host copies (recommended, not yet automated)
The dumps live on the same disk as the databases. Sync them off-host
periodically, e.g. from another machine:
```
rsync -av erik@overlord.snakedesert.se:backups/postgres/ ./overlord-backups/
```
## Restore
### inventory_db (plain Postgres)
```bash
docker exec -i inventory-db pg_restore -U inventory_user -d inventory_db --clean --if-exists < inventory-<stamp>.dump
```
### dereth (TimescaleDB — needs pre/post restore calls)
TimescaleDB requires putting the extension into restore mode around the
`pg_restore`, otherwise catalog rows fail:
```bash
# 1. Create a fresh DB (or use --clean against the existing one)
docker exec dereth-db psql -U postgres -c "CREATE DATABASE dereth_restore;"
docker exec dereth-db psql -U postgres -d dereth_restore -c "CREATE EXTENSION IF NOT EXISTS timescaledb;"
# 2. Pre-restore mode
docker exec dereth-db psql -U postgres -d dereth_restore -c "SELECT timescaledb_pre_restore();"
# 3. Restore the dump
docker exec -i dereth-db pg_restore -U postgres -d dereth_restore --no-owner < dereth-<stamp>.dump
# 4. Post-restore mode (re-enables background workers, validates catalog)
docker exec dereth-db psql -U postgres -d dereth_restore -c "SELECT timescaledb_post_restore();"
```
Notes:
- Step 3 reports one ignorable error — the dump's `CREATE EXTENSION
timescaledb` collides with the extension pre-created in step 1
("already exists", `errors ignored on restore: 1`). That is expected,
not a failed restore.
- The TimescaleDB **version** at restore time must be the **same** as at
dump time (restore first, then `ALTER EXTENSION timescaledb UPDATE` if
upgrading). Same-container restores with the image pinned in
docker-compose.yml (`timescale/timescaledb:2.19.3-pg14`) are fine.
Then either point `DATABASE_URL` at the restored DB or rename databases.
The `telemetry_events`/`spawn_events` hypertables come back empty (by
design); retention/compression policies are part of the dump and reattach.
## Verifying a backup
```bash
pg_restore --list dereth-<stamp>.dump | head # table of contents
pg_restore --list dereth-<stamp>.dump | grep -c 'TABLE DATA'
```
A dump that suddenly shrinks dramatically (check `backup.log` sizes) is the
canary for silent failure.

View file

@ -7,6 +7,7 @@ fabricated TelemetrySnapshot payloads at regular intervals. Useful for:
- Demonstrating real-time map updates without a live game client
"""
import asyncio # Async event loop and sleep support
import os
import websockets # WebSocket client for Python
import json # JSON serialization of payloads
from datetime import datetime, timedelta, timezone
@ -32,8 +33,10 @@ async def main() -> None:
# Starting coordinates (E/W and N/S)
ew = 0.0
ns = 0.0
# WebSocket endpoint for plugin telemetry (include secret for auth)
uri = "ws://localhost:8000/ws/position?secret=your_shared_secret"
# WebSocket endpoint for plugin telemetry. The secret must match the
# backend's SHARED_SECRET env var (no insecure default anymore).
secret = os.environ["SHARED_SECRET"]
uri = f"ws://localhost:8000/ws/position?secret={secret}"
# Connect to the plugin WebSocket endpoint with authentication
# Establish WebSocket connection to the server
async with websockets.connect(uri) as websocket:

100
main.py
View file

@ -8,7 +8,9 @@ endpoints for browser clients to retrieve live and historical data, trails, and
from collections import defaultdict
from datetime import datetime, timedelta, timezone
import hmac
import html as _html
import ipaddress
import json
import logging
import os
@ -990,10 +992,25 @@ live_equipment_cantrip_states: Dict[str, dict] = {}
live_nearby_objects: Dict[str, dict] = {}
dungeon_map_cache: Dict[str, dict] = {} # landblock hex string -> dungeon map data
# Shared secret used to authenticate plugin WebSocket connections (override for production)
SHARED_SECRET = "your_shared_secret"
# Secret key for signing session cookies (override via SECRET_KEY env var)
SECRET_KEY = os.getenv("SECRET_KEY", "change-me-in-production-please")
# Shared secret used to authenticate plugin WebSocket connections.
# MUST come from the environment — this repo is public, so a hardcoded value
# is no auth at all. When unset (or left at the old placeholder) we fail
# closed: every plugin connection is refused until it is configured.
SHARED_SECRET = os.getenv("SHARED_SECRET", "")
_SHARED_SECRET_OK = bool(SHARED_SECRET) and SHARED_SECRET != "your_shared_secret"
if not _SHARED_SECRET_OK:
logger.critical(
"SHARED_SECRET env var is unset or still the placeholder — "
"refusing ALL plugin WebSocket connections until it is set in .env"
)
# Secret key for signing session cookies. Fail closed: running with a
# publicly-known default would let anyone forge admin sessions.
SECRET_KEY = os.getenv("SECRET_KEY", "")
if not SECRET_KEY or SECRET_KEY == "change-me-in-production-please":
raise RuntimeError(
"SECRET_KEY env var must be set to a strong random value — "
"session cookies are signed with it"
)
SESSION_MAX_AGE = 30 * 24 * 3600 # 30 days in seconds
_serializer = URLSafeTimedSerializer(SECRET_KEY)
@ -1024,6 +1041,19 @@ _PUBLIC_PATHS = {"/login", "/logout"}
_PUBLIC_PREFIXES = ("/ws/position",) # Plugin WS uses X-Plugin-Secret
def _is_private_addr(host: str) -> bool:
"""True when `host` is a private/loopback address (RFC1918, 127/8, ::1).
Used by the internal-trust rule: a private TCP peer WITHOUT an
X-Forwarded-For header cannot have come through nginx and therefore
cannot originate from the internet.
"""
try:
return ipaddress.ip_address(host).is_private
except ValueError:
return False
class AuthMiddleware(BaseHTTPMiddleware):
"""Redirect unauthenticated requests to /login."""
@ -1046,20 +1076,20 @@ class AuthMiddleware(BaseHTTPMiddleware):
if path.startswith("/ws/live"):
return await call_next(request)
# Trust internal connections (Docker network gateway + loopback). The
# tracker port (8765) is bound to 127.0.0.1 in docker-compose.yml and
# only the host or other compose-network containers can reach it.
# This lets host-side helpers (overlord-agent, discord-rare-monitor,
# etc.) call any endpoint without forging a session cookie.
#
# IMPORTANT: We still try to decode the session cookie if present, so
# that endpoints like /me which check `request.state.user` work for
# real authenticated browsers proxied through nginx → docker-proxy
# (which makes them look like they're coming from 172.x). Without
# this, /me returned 401 even for logged-in users, silently
# disabling the admin-only UI on the dashboard.
# Trust genuinely internal callers only. The tracker port (8765) is
# published on 127.0.0.1, so host-side helpers (overlord-agent) and
# compose-network containers reach it directly — but so does ALL
# external browser traffic, via nginx → docker-proxy, which makes it
# arrive with a 172.x source IP. Source IP alone therefore proves
# nothing. The distinguishing signal is X-Forwarded-For: nginx sets
# it on every proxied request, while direct internal calls have no
# proxy in front of them and lack the header. A request with a
# private source AND no X-Forwarded-For cannot have come through
# nginx, i.e. cannot originate from the internet.
client_host = request.client.host if request.client else ""
if client_host.startswith("172.") or client_host in ("127.0.0.1", "::1", "localhost"):
if _is_private_addr(client_host) and "x-forwarded-for" not in request.headers:
# Still decode the cookie if present so request.state.user works
# for internal tools that do log in.
token = request.cookies.get("session")
if token:
user = verify_session_cookie(token)
@ -2945,9 +2975,13 @@ async def ws_receive_snapshots(
"""
global _plugin_connections
# Authenticate plugin connection using shared secret
key = secret or x_plugin_secret
if key != SHARED_SECRET:
# Authenticate plugin connection using shared secret (constant-time
# compare; refuse everything when the secret is not configured).
key = secret or x_plugin_secret or ""
# compare bytes: compare_digest(str, str) raises TypeError on non-ASCII
if not _SHARED_SECRET_OK or not hmac.compare_digest(
key.encode("utf-8", "replace"), SHARED_SECRET.encode("utf-8")
):
# Reject without completing the WebSocket handshake
logger.warning(
f"Plugin WebSocket authentication failed from {websocket.client}"
@ -3693,11 +3727,16 @@ async def ws_live_updates(websocket: WebSocket):
Manages a set of connected browser clients; listens for incoming command messages
and forwards them to the appropriate plugin client WebSocket.
"""
# Require valid session cookie for browser WebSocket.
# Internal Docker network connections (172.x.x.x) are trusted — this allows
# the Discord bot and other internal services to connect without a cookie.
# Require a valid session cookie for browser WebSockets. Internal
# services (discord-rare-monitor connects over the compose network) are
# identified by a private source IP WITHOUT an X-Forwarded-For header —
# nginx-proxied browser traffic always carries X-Forwarded-For, so an
# internet client can never satisfy this check (same rule as
# AuthMiddleware; see comment there).
client_host = websocket.client.host if websocket.client else ""
is_internal = client_host.startswith("172.") or client_host in ("127.0.0.1", "::1", "localhost")
is_internal = (
_is_private_addr(client_host) and "x-forwarded-for" not in websocket.headers
)
if not is_internal:
token = websocket.cookies.get("session")
if not token or not verify_session_cookie(token):
@ -3865,15 +3904,18 @@ async def get_stats(character_name: str):
@app.post("/character-stats/test")
async def test_character_stats_default():
"""Inject mock character_stats data for frontend development."""
return await test_character_stats("TestCharacter")
async def test_character_stats_default(request: Request):
"""Inject mock character_stats data for frontend development (admin only)."""
_require_admin(request)
return await test_character_stats("TestCharacter", request)
@app.post("/character-stats/test/{name}")
async def test_character_stats(name: str):
async def test_character_stats(name: str, request: Request):
"""Inject mock character_stats data for a specific character name.
Processes through the same pipeline as real plugin data."""
Processes through the same pipeline as real plugin data it OVERWRITES
the real character_stats row for {name}, hence admin-only."""
_require_admin(request)
mock_data = {
"type": "character_stats",
"timestamp": datetime.utcnow().isoformat() + "Z",

View file

@ -14,6 +14,12 @@
# WebSockets are long-lived; nginx's default 60s timeout drops idle clients.
# Removing these timeouts caused all plugin connections to drop every
# ~60s when no data flowed from backend to client (April 2026 incident).
# - SECURITY INVARIANT: every location that proxies to the `tracker`
# upstream MUST set proxy_set_header X-Forwarded-For. The backend treats
# a private-source request WITHOUT that header as internal (host/compose
# callers) and skips session auth — a tracker-bound location that forgot
# the header would silently bypass login for the whole internet. This
# includes any future port-80 or alternate server block.
# - /grafana/ panel embeds rely on Grafana's anonymous Viewer auth
# (GF_AUTH_ANONYMOUS_ENABLED=true in docker-compose.yml) — no credentials
# in this file. Do NOT hardcode tokens here: this file is committed to a

53
scripts/backup-databases.sh Executable file
View file

@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Nightly logical backups for both MosswartOverlord databases.
# Install as a cron job on the live host (see docs/backups.md). Note `bash`
# in the cron line (survives a lost executable bit) and that /home/erik/backups
# must exist BEFORE the first run (cron sets up the >> redirection before this
# script's mkdir runs):
# 15 3 * * * bash /home/erik/MosswartOverlord/scripts/backup-databases.sh >> /home/erik/backups/backup.log 2>&1
#
# What is backed up:
# - dereth (TimescaleDB): full schema + all data EXCEPT the raw
# telemetry_events/spawn_events hypertable chunks. Those tables hold
# ~12 GB of data that expires via retention policies in 7-30 days
# anyway; the irreplaceable rows (users, char_stats, rare_stats,
# rare_events, combat_stats*, portals, character_stats, server_status)
# are all included.
# - inventory_db (postgres): full dump (~1 GB raw, much smaller compressed).
#
# Restore procedure: docs/backups.md (TimescaleDB needs pre/post restore calls).
set -euo pipefail
# Dumps contain the users table (bcrypt hashes) — keep them owner-only.
umask 077
BACKUP_DIR="${BACKUP_DIR:-/home/erik/backups/postgres}"
KEEP_DAYS="${KEEP_DAYS:-7}"
STAMP="$(date -u +%Y%m%d-%H%M)"
mkdir -p "$BACKUP_DIR"
# dereth: -Fc is compressed; exclude hypertable chunk DATA (schema kept so a
# restore recreates the tables empty and retention/compression jobs reattach).
docker exec dereth-db pg_dump -U postgres -Fc \
--exclude-table-data='public.telemetry_events' \
--exclude-table-data='public.spawn_events' \
--exclude-table-data='_timescaledb_internal._hyper_*' \
dereth > "$BACKUP_DIR/dereth-$STAMP.dump.tmp"
# Canary: a healthy dereth dump is ~50 MB; a tiny one means pg_dump silently
# produced garbage (fail the run so the old dumps are kept and cron logs it).
if [ "$(stat -c%s "$BACKUP_DIR/dereth-$STAMP.dump.tmp")" -lt 10000000 ]; then
echo "$(date -u +%FT%TZ) FAIL dereth dump under 10MB — keeping old backups" >&2
exit 1
fi
mv "$BACKUP_DIR/dereth-$STAMP.dump.tmp" "$BACKUP_DIR/dereth-$STAMP.dump"
docker exec inventory-db pg_dump -U inventory_user -Fc inventory_db \
> "$BACKUP_DIR/inventory-$STAMP.dump.tmp"
mv "$BACKUP_DIR/inventory-$STAMP.dump.tmp" "$BACKUP_DIR/inventory-$STAMP.dump"
# Retention: keep KEEP_DAYS days of dailies.
find "$BACKUP_DIR" -name 'dereth-*.dump' -mtime +"$KEEP_DAYS" -delete
find "$BACKUP_DIR" -name 'inventory-*.dump' -mtime +"$KEEP_DAYS" -delete
# Clean up aborted runs older than a day.
find "$BACKUP_DIR" -name '*.dump.tmp' -mtime +1 -delete
echo "$(date -u +%FT%TZ) OK dereth=$(du -h "$BACKUP_DIR/dereth-$STAMP.dump" | cut -f1) inventory=$(du -h "$BACKUP_DIR/inventory-$STAMP.dump" | cut -f1)"