security: enforce real plugin secret, fix proxy auth bypass, loopback DB ports, nightly backups

- SHARED_SECRET now read from env and fail-closed: unset/placeholder refuses
  ALL plugin connections (constant-time compare). The old hardcoded
  'your_shared_secret' in this public repo was no auth at all. Dockerfile
  default removed; generate_data.py reads the env var.
- SECRET_KEY fails closed at startup (main.py and agent/auth.py) instead of
  falling back to a publicly-known signing key; agent systemd unit now
  requires /etc/overlord/agent.env (no '-' prefix).
- AuthMiddleware + /ws/live: replace the 172.x source-IP trust (which every
  nginx-proxied internet request satisfied via docker-proxy — full session
  bypass and unauthenticated in-game command injection) with
  private-source AND no X-Forwarded-For, i.e. only genuinely internal
  callers (overlord-agent on the host, compose-network services). Invariant
  documented in nginx/overlord.conf: every tracker-bound location must set
  X-Forwarded-For.
- /character-stats/test endpoints gated behind admin (they upsert real rows).
- docker-compose: bind 5432/5433 to 127.0.0.1 (both DBs were internet-
  reachable; active brute-force observed in dereth-db logs).
- discord-rare-monitor: drop dead SHARED_SECRET constant.
- scripts/backup-databases.sh + docs/backups.md: nightly pg_dump of both DBs
  (telemetry/spawn hypertable data excluded), 10MB canary, umask 077,
  TimescaleDB restore procedure.
- Remove stray mangled-path css file from repo root.

Adversarially reviewed pre-deploy (3-lens workflow): ship verdict; deploy-
sequencing blockers addressed (secret staged before enforcement, exec bit
set, cron uses bash).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Erik 2026-06-10 17:02:47 +02:00
parent c6a1af0c39
commit a28b61511c
12 changed files with 261 additions and 2579 deletions

View file

@ -42,7 +42,7 @@ Dereth Tracker is a real-time telemetry platform for Asheron's Call world tracki
- Connection pool: `min_size=5, max_size=100, command_timeout=120` (`db_async.py:21`). Postgres `max_connections` is the default 100, shared with Grafana and the agent's read-only role — don't widen the pool further. - Connection pool: `min_size=5, max_size=100, command_timeout=120` (`db_async.py:21`). Postgres `max_connections` is the default 100, shared with Grafana and the agent's read-only role — don't widen the pool further.
- Persisted event types: telemetry, spawn, rare, portal, character_stats, combat_stats. Everything else (vitals, quest, cantrips, nearby_objects, dungeon_map, share_*) is memory-only. - Persisted event types: telemetry, spawn, rare, portal, character_stats, combat_stats. Everything else (vitals, quest, cantrips, nearby_objects, dungeon_map, share_*) is memory-only.
- Read-only agent role `overlord_agent_ro` is provisioned manually via `agent/sql/0001_overlord_agent_ro.sql` (SELECT-only). - Read-only agent role `overlord_agent_ro` is provisioned manually via `agent/sql/0001_overlord_agent_ro.sql` (SELECT-only).
- There is **no backup mechanism** — durability is the two Docker volumes (`timescale-data`, `inventory-data`). - Backups: nightly cron on the host runs `scripts/backup-databases.sh` (pg_dump both DBs to `/home/erik/backups/postgres/`, 7-day retention; telemetry/spawn hypertable data deliberately excluded). Restore procedure: `docs/backups.md` — TimescaleDB needs `timescaledb_pre_restore()/post_restore()`.
- `db.py` is a dead legacy SQLite layer — nothing imports it. All persistence goes through `db_async.py`. - `db.py` is a dead legacy SQLite layer — nothing imports it. All persistence goes through `db_async.py`.
## Route conventions ## Route conventions

File diff suppressed because it is too large Load diff

View file

@ -36,13 +36,14 @@ ARG BUILD_VERSION=dev
ENV APP_VERSION=$BUILD_VERSION ENV APP_VERSION=$BUILD_VERSION
## Default environment variables for application configuration ## Default environment variables for application configuration
## NOTE: no SHARED_SECRET default here on purpose — main.py fails closed
## (refuses plugin connections) unless a real value arrives via compose/.env.
ENV DATABASE_URL=postgresql://postgres:password@db:5432/dereth \ ENV DATABASE_URL=postgresql://postgres:password@db:5432/dereth \
DB_MAX_SIZE_MB=2048 \ DB_MAX_SIZE_MB=2048 \
DB_RETENTION_DAYS=7 \ DB_RETENTION_DAYS=7 \
DB_MAX_SQL_LENGTH=1000000000 \ DB_MAX_SQL_LENGTH=1000000000 \
DB_MAX_SQL_VARIABLES=32766 \ DB_MAX_SQL_VARIABLES=32766 \
DB_WAL_AUTOCHECKPOINT_PAGES=1000 \ DB_WAL_AUTOCHECKPOINT_PAGES=1000
SHARED_SECRET=your_shared_secret
## Launch the FastAPI app using Uvicorn ## Launch the FastAPI app using Uvicorn
CMD ["uvicorn","main:app","--host","0.0.0.0","--port","8765","--workers","1","--no-access-log","--log-level","warning"] CMD ["uvicorn","main:app","--host","0.0.0.0","--port","8765","--workers","1","--no-access-log","--log-level","warning"]

View file

@ -12,8 +12,15 @@ import os
from fastapi import HTTPException, Request, status from fastapi import HTTPException, Request, status
from itsdangerous import BadSignature, SignatureExpired, URLSafeTimedSerializer from itsdangerous import BadSignature, SignatureExpired, URLSafeTimedSerializer
# Mirror main.py:996-998 # Mirror main.py — and fail closed like it does: starting with a known
SECRET_KEY = os.getenv("SECRET_KEY", "change-me-in-production-please") # default key would let anyone forge a valid session cookie.
SECRET_KEY = os.getenv("SECRET_KEY", "")
if not SECRET_KEY or SECRET_KEY == "change-me-in-production-please":
raise RuntimeError(
"SECRET_KEY env var must be set (shared with dereth-tracker; see "
"/etc/overlord/agent.env) — refusing to start with a forgeable "
"session-signing key"
)
SESSION_MAX_AGE = 30 * 24 * 3600 # 30 days SESSION_MAX_AGE = 30 * 24 * 3600 # 30 days
_serializer = URLSafeTimedSerializer(SECRET_KEY) _serializer = URLSafeTimedSerializer(SECRET_KEY)

View file

@ -20,8 +20,10 @@ WorkingDirectory=/home/erik/MosswartOverlord
# HOME explicitly set so claude reads /var/lib/overlord-agent/.claude/* # HOME explicitly set so claude reads /var/lib/overlord-agent/.claude/*
# instead of trying /home/erik/.claude/* (which is now 0700, locked out). # instead of trying /home/erik/.claude/* (which is now 0700, locked out).
Environment="HOME=/var/lib/overlord-agent" Environment="HOME=/var/lib/overlord-agent"
# Secrets file (root:overlord-agent 0640). # Secrets file (root:overlord-agent 0640). REQUIRED (no leading '-'):
EnvironmentFile=-/etc/overlord/agent.env # a missing secrets file must abort startup, not fail open — auth.py also
# refuses to start without SECRET_KEY.
EnvironmentFile=/etc/overlord/agent.env
# Run inside the venv populated by install.sh. # Run inside the venv populated by install.sh.
ExecStart=/home/erik/MosswartOverlord/agent/.venv/bin/python -m agent.service ExecStart=/home/erik/MosswartOverlord/agent/.venv/bin/python -m agent.service
Restart=on-failure Restart=on-failure

View file

@ -34,7 +34,6 @@ logger = logging.getLogger(__name__)
# Configuration from environment variables # Configuration from environment variables
DISCORD_TOKEN = os.getenv('DISCORD_RARE_BOT_TOKEN') DISCORD_TOKEN = os.getenv('DISCORD_RARE_BOT_TOKEN')
WEBSOCKET_URL = os.getenv('DERETH_TRACKER_WS_URL', 'ws://dereth-tracker:8765/ws/live') WEBSOCKET_URL = os.getenv('DERETH_TRACKER_WS_URL', 'ws://dereth-tracker:8765/ws/live')
SHARED_SECRET = 'your_shared_secret'
ACLOG_CHANNEL_ID = int(os.getenv('ACLOG_CHANNEL_ID', '1349649482786275328')) ACLOG_CHANNEL_ID = int(os.getenv('ACLOG_CHANNEL_ID', '1349649482786275328'))
COMMON_RARE_CHANNEL_ID = int(os.getenv('COMMON_RARE_CHANNEL_ID', '1355328792184226014')) COMMON_RARE_CHANNEL_ID = int(os.getenv('COMMON_RARE_CHANNEL_ID', '1355328792184226014'))
GREAT_RARE_CHANNEL_ID = int(os.getenv('GREAT_RARE_CHANNEL_ID', '1353676584334131211')) GREAT_RARE_CHANNEL_ID = int(os.getenv('GREAT_RARE_CHANNEL_ID', '1353676584334131211'))

View file

@ -62,7 +62,11 @@ services:
volumes: volumes:
- timescale-data:/var/lib/postgresql/data - timescale-data:/var/lib/postgresql/data
ports: ports:
- "5432:5432" # Loopback only — Docker-published ports bypass ufw, and this host is
# internet-facing (active brute-force on the open port observed June
# 2026). In-stack consumers use the compose network; host-side tools
# (psql, overlord-agent) use 127.0.0.1.
- "127.0.0.1:5432:5432"
restart: unless-stopped restart: unless-stopped
healthcheck: healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"] test: ["CMD-SHELL", "pg_isready -U postgres"]
@ -104,7 +108,8 @@ services:
volumes: volumes:
- inventory-data:/var/lib/postgresql/data - inventory-data:/var/lib/postgresql/data
ports: ports:
- "5433:5432" # Loopback only — see db service note.
- "127.0.0.1:5433:5432"
restart: unless-stopped restart: unless-stopped
healthcheck: healthcheck:
test: ["CMD-SHELL", "pg_isready -U inventory_user"] test: ["CMD-SHELL", "pg_isready -U inventory_user"]

102
docs/backups.md Normal file
View file

@ -0,0 +1,102 @@
# Database backups
Nightly logical backups of both databases, taken by
[`scripts/backup-databases.sh`](../scripts/backup-databases.sh) via a cron
job on the live host (user `erik`, who is in the `docker` group — no sudo
needed). Install with:
```
mkdir -p /home/erik/backups # MUST exist before the first run —
# cron opens the log redirect before
# the script's own mkdir executes
crontab -e # add the line below
15 3 * * * bash /home/erik/MosswartOverlord/scripts/backup-databases.sh >> /home/erik/backups/backup.log 2>&1
```
Dumps land in `/home/erik/backups/postgres/` as `dereth-YYYYMMDD-HHMM.dump`
and `inventory-YYYYMMDD-HHMM.dump` (pg_dump custom format, compressed,
mode 0600). Retention: ~8 days of dailies (`-mtime +7`), pruned by the
script itself only after a successful run. The nightly `backup.log` will
contain pg_dump circular-FK warnings about hypertable chunks — those are
normal; the canary to watch is the printed dump sizes (a healthy dereth
dump is ~50 MB, and the script aborts if it drops below 10 MB).
## What is and isn't included
- **dereth** (TimescaleDB): everything EXCEPT the row data of the
`telemetry_events` and `spawn_events` hypertables (their chunk data in
`_timescaledb_internal._hyper_*` is excluded). That data is ~12 GB and
expires through retention policies within 730 days anyway. The
irreplaceable tables — `users`, `char_stats`, `rare_stats`,
`rare_stats_sessions`, `rare_events`, `combat_stats`,
`combat_stats_sessions`, `portals`, `character_stats`, `server_status`
are fully included. Table *schemas* for the excluded hypertables are
still dumped, so a restore recreates them empty.
- **inventory_db**: full dump (items, combat stats, enhancements, spells,
requirements, ratings, raw JSON).
⚠ The `_timescaledb_internal._hyper_*` exclusion drops the chunk data of
**every** hypertable, present and future. If an irreplaceable table is ever
converted to a hypertable (or a continuous aggregate is added), revisit the
exclusion list — otherwise its data silently disappears from backups.
## Off-host copies (recommended, not yet automated)
The dumps live on the same disk as the databases. Sync them off-host
periodically, e.g. from another machine:
```
rsync -av erik@overlord.snakedesert.se:backups/postgres/ ./overlord-backups/
```
## Restore
### inventory_db (plain Postgres)
```bash
docker exec -i inventory-db pg_restore -U inventory_user -d inventory_db --clean --if-exists < inventory-<stamp>.dump
```
### dereth (TimescaleDB — needs pre/post restore calls)
TimescaleDB requires putting the extension into restore mode around the
`pg_restore`, otherwise catalog rows fail:
```bash
# 1. Create a fresh DB (or use --clean against the existing one)
docker exec dereth-db psql -U postgres -c "CREATE DATABASE dereth_restore;"
docker exec dereth-db psql -U postgres -d dereth_restore -c "CREATE EXTENSION IF NOT EXISTS timescaledb;"
# 2. Pre-restore mode
docker exec dereth-db psql -U postgres -d dereth_restore -c "SELECT timescaledb_pre_restore();"
# 3. Restore the dump
docker exec -i dereth-db pg_restore -U postgres -d dereth_restore --no-owner < dereth-<stamp>.dump
# 4. Post-restore mode (re-enables background workers, validates catalog)
docker exec dereth-db psql -U postgres -d dereth_restore -c "SELECT timescaledb_post_restore();"
```
Notes:
- Step 3 reports one ignorable error — the dump's `CREATE EXTENSION
timescaledb` collides with the extension pre-created in step 1
("already exists", `errors ignored on restore: 1`). That is expected,
not a failed restore.
- The TimescaleDB **version** at restore time must be the **same** as at
dump time (restore first, then `ALTER EXTENSION timescaledb UPDATE` if
upgrading). Same-container restores with the image pinned in
docker-compose.yml (`timescale/timescaledb:2.19.3-pg14`) are fine.
Then either point `DATABASE_URL` at the restored DB or rename databases.
The `telemetry_events`/`spawn_events` hypertables come back empty (by
design); retention/compression policies are part of the dump and reattach.
## Verifying a backup
```bash
pg_restore --list dereth-<stamp>.dump | head # table of contents
pg_restore --list dereth-<stamp>.dump | grep -c 'TABLE DATA'
```
A dump that suddenly shrinks dramatically (check `backup.log` sizes) is the
canary for silent failure.

View file

@ -7,6 +7,7 @@ fabricated TelemetrySnapshot payloads at regular intervals. Useful for:
- Demonstrating real-time map updates without a live game client - Demonstrating real-time map updates without a live game client
""" """
import asyncio # Async event loop and sleep support import asyncio # Async event loop and sleep support
import os
import websockets # WebSocket client for Python import websockets # WebSocket client for Python
import json # JSON serialization of payloads import json # JSON serialization of payloads
from datetime import datetime, timedelta, timezone from datetime import datetime, timedelta, timezone
@ -32,8 +33,10 @@ async def main() -> None:
# Starting coordinates (E/W and N/S) # Starting coordinates (E/W and N/S)
ew = 0.0 ew = 0.0
ns = 0.0 ns = 0.0
# WebSocket endpoint for plugin telemetry (include secret for auth) # WebSocket endpoint for plugin telemetry. The secret must match the
uri = "ws://localhost:8000/ws/position?secret=your_shared_secret" # backend's SHARED_SECRET env var (no insecure default anymore).
secret = os.environ["SHARED_SECRET"]
uri = f"ws://localhost:8000/ws/position?secret={secret}"
# Connect to the plugin WebSocket endpoint with authentication # Connect to the plugin WebSocket endpoint with authentication
# Establish WebSocket connection to the server # Establish WebSocket connection to the server
async with websockets.connect(uri) as websocket: async with websockets.connect(uri) as websocket:

100
main.py
View file

@ -8,7 +8,9 @@ endpoints for browser clients to retrieve live and historical data, trails, and
from collections import defaultdict from collections import defaultdict
from datetime import datetime, timedelta, timezone from datetime import datetime, timedelta, timezone
import hmac
import html as _html import html as _html
import ipaddress
import json import json
import logging import logging
import os import os
@ -990,10 +992,25 @@ live_equipment_cantrip_states: Dict[str, dict] = {}
live_nearby_objects: Dict[str, dict] = {} live_nearby_objects: Dict[str, dict] = {}
dungeon_map_cache: Dict[str, dict] = {} # landblock hex string -> dungeon map data dungeon_map_cache: Dict[str, dict] = {} # landblock hex string -> dungeon map data
# Shared secret used to authenticate plugin WebSocket connections (override for production) # Shared secret used to authenticate plugin WebSocket connections.
SHARED_SECRET = "your_shared_secret" # MUST come from the environment — this repo is public, so a hardcoded value
# Secret key for signing session cookies (override via SECRET_KEY env var) # is no auth at all. When unset (or left at the old placeholder) we fail
SECRET_KEY = os.getenv("SECRET_KEY", "change-me-in-production-please") # closed: every plugin connection is refused until it is configured.
SHARED_SECRET = os.getenv("SHARED_SECRET", "")
_SHARED_SECRET_OK = bool(SHARED_SECRET) and SHARED_SECRET != "your_shared_secret"
if not _SHARED_SECRET_OK:
logger.critical(
"SHARED_SECRET env var is unset or still the placeholder — "
"refusing ALL plugin WebSocket connections until it is set in .env"
)
# Secret key for signing session cookies. Fail closed: running with a
# publicly-known default would let anyone forge admin sessions.
SECRET_KEY = os.getenv("SECRET_KEY", "")
if not SECRET_KEY or SECRET_KEY == "change-me-in-production-please":
raise RuntimeError(
"SECRET_KEY env var must be set to a strong random value — "
"session cookies are signed with it"
)
SESSION_MAX_AGE = 30 * 24 * 3600 # 30 days in seconds SESSION_MAX_AGE = 30 * 24 * 3600 # 30 days in seconds
_serializer = URLSafeTimedSerializer(SECRET_KEY) _serializer = URLSafeTimedSerializer(SECRET_KEY)
@ -1024,6 +1041,19 @@ _PUBLIC_PATHS = {"/login", "/logout"}
_PUBLIC_PREFIXES = ("/ws/position",) # Plugin WS uses X-Plugin-Secret _PUBLIC_PREFIXES = ("/ws/position",) # Plugin WS uses X-Plugin-Secret
def _is_private_addr(host: str) -> bool:
"""True when `host` is a private/loopback address (RFC1918, 127/8, ::1).
Used by the internal-trust rule: a private TCP peer WITHOUT an
X-Forwarded-For header cannot have come through nginx and therefore
cannot originate from the internet.
"""
try:
return ipaddress.ip_address(host).is_private
except ValueError:
return False
class AuthMiddleware(BaseHTTPMiddleware): class AuthMiddleware(BaseHTTPMiddleware):
"""Redirect unauthenticated requests to /login.""" """Redirect unauthenticated requests to /login."""
@ -1046,20 +1076,20 @@ class AuthMiddleware(BaseHTTPMiddleware):
if path.startswith("/ws/live"): if path.startswith("/ws/live"):
return await call_next(request) return await call_next(request)
# Trust internal connections (Docker network gateway + loopback). The # Trust genuinely internal callers only. The tracker port (8765) is
# tracker port (8765) is bound to 127.0.0.1 in docker-compose.yml and # published on 127.0.0.1, so host-side helpers (overlord-agent) and
# only the host or other compose-network containers can reach it. # compose-network containers reach it directly — but so does ALL
# This lets host-side helpers (overlord-agent, discord-rare-monitor, # external browser traffic, via nginx → docker-proxy, which makes it
# etc.) call any endpoint without forging a session cookie. # arrive with a 172.x source IP. Source IP alone therefore proves
# # nothing. The distinguishing signal is X-Forwarded-For: nginx sets
# IMPORTANT: We still try to decode the session cookie if present, so # it on every proxied request, while direct internal calls have no
# that endpoints like /me which check `request.state.user` work for # proxy in front of them and lack the header. A request with a
# real authenticated browsers proxied through nginx → docker-proxy # private source AND no X-Forwarded-For cannot have come through
# (which makes them look like they're coming from 172.x). Without # nginx, i.e. cannot originate from the internet.
# this, /me returned 401 even for logged-in users, silently
# disabling the admin-only UI on the dashboard.
client_host = request.client.host if request.client else "" client_host = request.client.host if request.client else ""
if client_host.startswith("172.") or client_host in ("127.0.0.1", "::1", "localhost"): if _is_private_addr(client_host) and "x-forwarded-for" not in request.headers:
# Still decode the cookie if present so request.state.user works
# for internal tools that do log in.
token = request.cookies.get("session") token = request.cookies.get("session")
if token: if token:
user = verify_session_cookie(token) user = verify_session_cookie(token)
@ -2945,9 +2975,13 @@ async def ws_receive_snapshots(
""" """
global _plugin_connections global _plugin_connections
# Authenticate plugin connection using shared secret # Authenticate plugin connection using shared secret (constant-time
key = secret or x_plugin_secret # compare; refuse everything when the secret is not configured).
if key != SHARED_SECRET: key = secret or x_plugin_secret or ""
# compare bytes: compare_digest(str, str) raises TypeError on non-ASCII
if not _SHARED_SECRET_OK or not hmac.compare_digest(
key.encode("utf-8", "replace"), SHARED_SECRET.encode("utf-8")
):
# Reject without completing the WebSocket handshake # Reject without completing the WebSocket handshake
logger.warning( logger.warning(
f"Plugin WebSocket authentication failed from {websocket.client}" f"Plugin WebSocket authentication failed from {websocket.client}"
@ -3693,11 +3727,16 @@ async def ws_live_updates(websocket: WebSocket):
Manages a set of connected browser clients; listens for incoming command messages Manages a set of connected browser clients; listens for incoming command messages
and forwards them to the appropriate plugin client WebSocket. and forwards them to the appropriate plugin client WebSocket.
""" """
# Require valid session cookie for browser WebSocket. # Require a valid session cookie for browser WebSockets. Internal
# Internal Docker network connections (172.x.x.x) are trusted — this allows # services (discord-rare-monitor connects over the compose network) are
# the Discord bot and other internal services to connect without a cookie. # identified by a private source IP WITHOUT an X-Forwarded-For header —
# nginx-proxied browser traffic always carries X-Forwarded-For, so an
# internet client can never satisfy this check (same rule as
# AuthMiddleware; see comment there).
client_host = websocket.client.host if websocket.client else "" client_host = websocket.client.host if websocket.client else ""
is_internal = client_host.startswith("172.") or client_host in ("127.0.0.1", "::1", "localhost") is_internal = (
_is_private_addr(client_host) and "x-forwarded-for" not in websocket.headers
)
if not is_internal: if not is_internal:
token = websocket.cookies.get("session") token = websocket.cookies.get("session")
if not token or not verify_session_cookie(token): if not token or not verify_session_cookie(token):
@ -3865,15 +3904,18 @@ async def get_stats(character_name: str):
@app.post("/character-stats/test") @app.post("/character-stats/test")
async def test_character_stats_default(): async def test_character_stats_default(request: Request):
"""Inject mock character_stats data for frontend development.""" """Inject mock character_stats data for frontend development (admin only)."""
return await test_character_stats("TestCharacter") _require_admin(request)
return await test_character_stats("TestCharacter", request)
@app.post("/character-stats/test/{name}") @app.post("/character-stats/test/{name}")
async def test_character_stats(name: str): async def test_character_stats(name: str, request: Request):
"""Inject mock character_stats data for a specific character name. """Inject mock character_stats data for a specific character name.
Processes through the same pipeline as real plugin data.""" Processes through the same pipeline as real plugin data it OVERWRITES
the real character_stats row for {name}, hence admin-only."""
_require_admin(request)
mock_data = { mock_data = {
"type": "character_stats", "type": "character_stats",
"timestamp": datetime.utcnow().isoformat() + "Z", "timestamp": datetime.utcnow().isoformat() + "Z",

View file

@ -14,6 +14,12 @@
# WebSockets are long-lived; nginx's default 60s timeout drops idle clients. # WebSockets are long-lived; nginx's default 60s timeout drops idle clients.
# Removing these timeouts caused all plugin connections to drop every # Removing these timeouts caused all plugin connections to drop every
# ~60s when no data flowed from backend to client (April 2026 incident). # ~60s when no data flowed from backend to client (April 2026 incident).
# - SECURITY INVARIANT: every location that proxies to the `tracker`
# upstream MUST set proxy_set_header X-Forwarded-For. The backend treats
# a private-source request WITHOUT that header as internal (host/compose
# callers) and skips session auth — a tracker-bound location that forgot
# the header would silently bypass login for the whole internet. This
# includes any future port-80 or alternate server block.
# - /grafana/ panel embeds rely on Grafana's anonymous Viewer auth # - /grafana/ panel embeds rely on Grafana's anonymous Viewer auth
# (GF_AUTH_ANONYMOUS_ENABLED=true in docker-compose.yml) — no credentials # (GF_AUTH_ANONYMOUS_ENABLED=true in docker-compose.yml) — no credentials
# in this file. Do NOT hardcode tokens here: this file is committed to a # in this file. Do NOT hardcode tokens here: this file is committed to a

53
scripts/backup-databases.sh Executable file
View file

@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Nightly logical backups for both MosswartOverlord databases.
# Install as a cron job on the live host (see docs/backups.md). Note `bash`
# in the cron line (survives a lost executable bit) and that /home/erik/backups
# must exist BEFORE the first run (cron sets up the >> redirection before this
# script's mkdir runs):
# 15 3 * * * bash /home/erik/MosswartOverlord/scripts/backup-databases.sh >> /home/erik/backups/backup.log 2>&1
#
# What is backed up:
# - dereth (TimescaleDB): full schema + all data EXCEPT the raw
# telemetry_events/spawn_events hypertable chunks. Those tables hold
# ~12 GB of data that expires via retention policies in 7-30 days
# anyway; the irreplaceable rows (users, char_stats, rare_stats,
# rare_events, combat_stats*, portals, character_stats, server_status)
# are all included.
# - inventory_db (postgres): full dump (~1 GB raw, much smaller compressed).
#
# Restore procedure: docs/backups.md (TimescaleDB needs pre/post restore calls).
set -euo pipefail
# Dumps contain the users table (bcrypt hashes) — keep them owner-only.
umask 077
BACKUP_DIR="${BACKUP_DIR:-/home/erik/backups/postgres}"
KEEP_DAYS="${KEEP_DAYS:-7}"
STAMP="$(date -u +%Y%m%d-%H%M)"
mkdir -p "$BACKUP_DIR"
# dereth: -Fc is compressed; exclude hypertable chunk DATA (schema kept so a
# restore recreates the tables empty and retention/compression jobs reattach).
docker exec dereth-db pg_dump -U postgres -Fc \
--exclude-table-data='public.telemetry_events' \
--exclude-table-data='public.spawn_events' \
--exclude-table-data='_timescaledb_internal._hyper_*' \
dereth > "$BACKUP_DIR/dereth-$STAMP.dump.tmp"
# Canary: a healthy dereth dump is ~50 MB; a tiny one means pg_dump silently
# produced garbage (fail the run so the old dumps are kept and cron logs it).
if [ "$(stat -c%s "$BACKUP_DIR/dereth-$STAMP.dump.tmp")" -lt 10000000 ]; then
echo "$(date -u +%FT%TZ) FAIL dereth dump under 10MB — keeping old backups" >&2
exit 1
fi
mv "$BACKUP_DIR/dereth-$STAMP.dump.tmp" "$BACKUP_DIR/dereth-$STAMP.dump"
docker exec inventory-db pg_dump -U inventory_user -Fc inventory_db \
> "$BACKUP_DIR/inventory-$STAMP.dump.tmp"
mv "$BACKUP_DIR/inventory-$STAMP.dump.tmp" "$BACKUP_DIR/inventory-$STAMP.dump"
# Retention: keep KEEP_DAYS days of dailies.
find "$BACKUP_DIR" -name 'dereth-*.dump' -mtime +"$KEEP_DAYS" -delete
find "$BACKUP_DIR" -name 'inventory-*.dump' -mtime +"$KEEP_DAYS" -delete
# Clean up aborted runs older than a day.
find "$BACKUP_DIR" -name '*.dump.tmp' -mtime +1 -delete
echo "$(date -u +%FT%TZ) OK dereth=$(du -h "$BACKUP_DIR/dereth-$STAMP.dump" | cut -f1) inventory=$(du -h "$BACKUP_DIR/inventory-$STAMP.dump" | cut -f1)"