Commit graph

12 commits

Author SHA1 Message Date
Erik
1196746dbe fix(agent): SQL parser robust against sqlglot version drift
The query_telemetry_db tool was crashing with AttributeError because
exp.AlterTable doesn't exist in this sqlglot version (renamed to Alter).
Made the deny-class list build defensively via getattr and dropped any
classes that the installed sqlglot doesn't expose.

Also broadened the deny list (Alter, AlterColumn, AlterDatabase, Truncate,
Grant, Revoke, Copy) and made the toplevel allowlist tolerant of missing
classes too. The walk() return shape is also normalized in case sqlglot
versions yield (node, parent, key) tuples vs. bare nodes.

Belt-and-suspenders is fine — the GRANT-SELECT-only PG role is the real
write barrier; the parser is just a faster/friendlier reject path.
2026-04-25 23:07:00 +02:00
Erik
0633865598 fix(agent): block Agent + Gmail/Drive/Calendar tools, brief model not to probe
Two complementary changes after observing the model probe boundaries
(it tried mcp__claude_ai_Gmail__search_threads, then tried to delegate
to a subagent via the Agent tool, then suggested the user edit
settings.local.json to add Gmail tools):

1. claude_wrapper.py adds to --disallowed-tools:
   - Agent  (subagent spawning — should never delegate)
   - WebFetch (already; settings.json re-allows acpedia.org only)
   - Every Gmail/Calendar/Drive connector tool name we know about

2. CLAUDE.md adds a 'Non-negotiable scope rules' section:
   - Be a read-only game-state QA service, nothing else
   - Don't attempt tools outside your role
   - Don't explain how to bypass restrictions
   - Don't suggest settings.json edits
   - Don't enumerate hidden tools when asked

Soft (system-prompt) + hard (CLI flag) defenses combined.
2026-04-25 22:45:39 +02:00
Erik
e780f249d1 fix(agent): keep strict permissions server-side, not in repo
The previous commit put .claude/settings.json IN THE REPO, which would
have applied its strict deny rules to ANY Claude Code invocation from
this cwd — including the human user's interactive dev sessions on their
own machine. That's wrong; the production agent's lockdown should not
constrain the developer.

Remove the committed file and gitignore .claude/ entirely. The repo is
permission-neutral now.

Strict permissions for the production agent come from two server-only
sources:
  1. CLI flags in agent/claude_wrapper.py (--allowed-tools +
     --disallowed-tools, passed by the systemd-spawned subprocess only)
  2. /var/lib/overlord-agent/.claude/settings.json (the agent's own HOME
     — separate from any user's .claude/)

Also bumps claude_wrapper.py with the explicit --disallowed-tools list
of meta-tools (ToolSearch, Monitor, TodoWrite, TaskOutput, Skill, cron
tools, etc.) that the --allowed-tools whitelist does not block on its
own. Verified empirically: with only --allowed-tools, ToolSearch was
still callable; --disallowed-tools is required.
2026-04-25 22:26:02 +02:00
Erik
f894399165 feat(agent): isolate from erik — dedicated overlord-agent user
The agent service was running as User=erik, which meant:
- Sessions polluted erik's ~/.claude/projects/
- erik's .claude/settings.local.json (months of accumulated dev permissions
  for docker/git/dotnet/etc.) was loaded by the production agent, defeating
  the --allowed-tools whitelist
- Subscription rate quota mingled between human-erik's interactive Claude
  Code use and the production assistant
- Theoretical access to /home/erik/.ssh, .bash_history, .gitconfig

Now:
- User=overlord-agent (system account, no shell, /var/lib/overlord-agent home)
- HOME=/var/lib/overlord-agent — claude state fully isolated from erik
- /home/erik/.claude permissions tightened to 0700 (was 0755)
- group=overlord-agent on the repo + /etc/overlord/agent.env (read-only)

Project settings:
- New strict committed .claude/settings.json: deny Bash/Read/Write/Edit/
  Glob/Grep/NotebookEdit/WebSearch; allow only WebFetch(domain:acpedia.org)
- .claude/settings.local.json now gitignored (was leaking dev permissions
  to the server through the deploy)
2026-04-25 21:50:57 +02:00
Erik
49ae4369e0 fix(agent): relax SystemCallFilter — Node needs @cpu-emulation etc.
The extra ~@cpu-emulation ~@obsolete ~@swap ~@raw-io negations on top of
@system-service killed Claude Code (Node) with SIGSYS during startup.

Keep just the truly dangerous groups blocked: ~@privileged ~@reboot
~@mount. The base @system-service preset already excludes others (no
@debug, no @resources, etc. are included by default in that preset).
2026-04-25 21:31:14 +02:00
Erik
5cf052cedf fix(agent): drop MemoryDenyWriteExecute — breaks Node.js V8 JIT
Claude Code is a Node app. V8 JIT requires W^X transitions via mprotect
with PROT_EXEC on JIT'd code pages. MemoryDenyWriteExecute kills the
process with SIGTRAP/abort during startup (~10ms in).

Without JIT we'd have to use --jitless mode, which destroys performance.
The other systemd hardening (ProtectSystem, ProtectHome,
InaccessiblePaths, NoNewPrivileges, capability drop, syscall filter,
PrivateTmp, etc.) still gives strong filesystem and privilege isolation.
The remaining shellcode-injection risk is theoretical — there is no
Bash/Write/Edit tool exposed for an attacker to chain into.

Also: MemoryLimit -> MemoryMax (deprecated unit form).
2026-04-25 21:29:16 +02:00
Erik
9d4c724b7f feat(agent): security hardening — systemd lockdown, rate limit, audit log
systemd unit now applies defense-in-depth:
- ProtectSystem=strict + ProtectHome=read-only (rest of FS sealed)
- ReadWritePaths only for ~/.claude (session JSONLs) and venv + audit log
- InaccessiblePaths blocks /etc/shadow, /etc/ssh, /root, ~/.ssh, shell history
- NoNewPrivileges + dropped capabilities (no setuid escalation, no caps)
- PrivateTmp, PrivateDevices, ProtectKernel*, MemoryDenyWriteExecute
- SystemCallFilter @system-service ~@privileged ~@debug ~@mount etc.
- RestrictAddressFamilies blocks raw/packet sockets

Application layer:
- Per-user rate limit 60/hour (configurable via AGENT_RATE_MAX)
- Per-user concurrency cap of 1 in-flight (no parallel claude burns)
- JSONL audit log of every /agent/ask to /var/log/overlord-agent/audit.jsonl
  Logs username, message preview, result preview, timing, errors.

Plus secrets migration: EnvironmentFile now prefers /etc/overlord/agent.env
(root:erik 0640) over /home/erik/MosswartOverlord/.env, so even the
read-only /home doesn't expose them. Falls back to old path during
transition.
2026-04-25 21:25:40 +02:00
Erik
4ae18536be feat(agent): cross-char search_items tool + bump timeouts
Adds an MCP tool wrapping the inventory-service /search/items endpoint
with include_all_characters=true, so questions like 'find me a bracelet
with Legendary Acid Ward on any unequipped char' resolve in ONE tool call
instead of looping get_inventory over 60+ chars (which timed out at 120s).

- agent/tools.py: search_items_global wrapper
- agent/mcp_overlord.py: register new tool with detailed schema doc
- agent/claude_wrapper.py: include in --allowed-tools whitelist;
  bump timeout 120s -> 240s
- nginx/overlord.conf: bump /api/agent/ proxy timeout 180s -> 300s
- CLAUDE.md: brief Claude to USE search_items for cross-char searches
2026-04-25 21:13:26 +02:00
Erik
d3943e894c fix(agent): SECURITY — replace bypassPermissions with dontAsk
bypassPermissions ignores --allowed-tools entirely (per
permission-modes.md docs). With it, the model could call Bash, Write,
Edit, Read, etc. — confirmed by writing /tmp/owned.sh in a test.

dontAsk is the correct production headless mode: auto-DENIES anything
outside the --allowed-tools whitelist instead of prompting. Without
this, our entire MCP whitelist was effectively useless.
2026-04-25 21:05:53 +02:00
Erik
6d5819d297 fix(agent): use --resume on existing sessions, --session-id only for new
Claude Code rejects --session-id on a session that already exists on disk
('Session ID ... is already in use'). The first message of a conversation
must use --session-id to create; every message after must use --resume.

Detect by checking ~/.claude/projects/<encoded-cwd>/<uuid>.jsonl. Plus a
belt-and-suspenders retry: if --session-id surprisingly fails with the
'already in use' string, automatically retry with --resume.

This was the bug that caused chat windows to fail on the second message.
2026-04-25 20:51:46 +02:00
Erik
a3353e572d fix(agent): whitelist MCP tools + bypass permissions for unattended service 2026-04-25 20:46:42 +02:00
Erik
79cf88d3f7 feat(agent): Phase 1 — chat-window AI assistant via Claude Code subprocess
Adds an in-dashboard AI assistant that answers questions about live game
state. Designed reactively (no background loops) — every user message in
the chat window or via /api/agent/ask runs one `claude -p` invocation.

Architecture:
- New host-side FastAPI service (agent/) on 127.0.0.1:8767, OUTSIDE the
  dereth-tracker Docker container because `claude` and ~/.claude
  credentials live on the host.
- nginx routes /api/agent/* to the host service.
- The same browser session cookie the tracker issues authenticates
  agent requests (shared SECRET_KEY).
- The agent shells out to `claude -p --session-id <uuid>` with
  cwd=/home/erik/MosswartOverlord. Sessions persist as JSONL on disk
  via Claude Code's built-in machinery.
- An MCP stdio server (agent/mcp_overlord.py) exposes tools to Claude:
  get_live_players, get_recent_rares, query_telemetry_db (read-only,
  parsed by sqlglot to reject DML/DDL), get_player_state, get_inventory,
  get_inventory_search, get_combat_stats, get_equipment_cantrips,
  get_quest_status, get_server_health, suitbuilder_search.
- Read-only PG role (overlord_agent_ro) is the second line of defense
  on the SQL tool — even a parser bypass can't mutate.

Frontend:
- AgentWindow.tsx — draggable chat window with localStorage-pinned
  session UUID, "New Chat" button, on-mount rehydration from
  /agent/sessions/{id}/history (parses Claude Code's JSONL).
- Wired into WindowRenderer + Sidebar (🤖 Assistant button).

Operational:
- systemd unit (overlord-agent.service) + install.sh.
- agent/README.md documents env vars, deploy flow, smoke tests.
- nginx/overlord.conf gets a new /api/agent/ location with 180s timeout.
- CLAUDE.md gets an "Overlord Assistant Mode" section briefing the
  agent on which tools to use and how to behave.

NOT YET DEPLOYED — server still needs:
1. Apply agent/sql/0001_overlord_agent_ro.sql + ALTER ROLE password
2. Add AGENT_DB_DSN to /home/erik/MosswartOverlord/.env
3. bash agent/install.sh (creates venv, installs unit, starts service)
4. sudo cp /home/erik/MosswartOverlord/nginx/overlord.conf to nginx + reload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 20:43:59 +02:00