fix(live): window 'online' on server receive-time, not client clock

The player-count flapping was client clock skew: telemetry is stamped with the
game machine's DateTime.UtcNow (WebSocket.cs), and machines' clocks drift up to
~90s apart (proven: per-char offsets span -31s..+59s with steady 6s cadence; a
wrong server clock would shift all equally, so the SPREAD proves clients differ
from each other; a +59s future timestamp rules out lag). /live windowed on that
client timestamp, so characters whose clock sat near the 30s boundary blinked
in and out.

Fix: stamp each telemetry row with the server's receive-time (received_at) and
window the /live 'online' query on COALESCE(received_at, timestamp) instead of
the client timestamp. A coarse timestamp bound (10 min) is kept only for
TimescaleDB chunk pruning. Column added idempotently in init_db_async; COALESCE
falls back to the client timestamp for pre-migration rows. Verified on the live
DB: query valid, 8ms, equivalent pre-population. ~free CPU (one datetime.now()
per ~14 inserts/sec).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Erik 2026-06-23 23:34:35 +02:00
parent 645feef9aa
commit 0565a54ae5
2 changed files with 36 additions and 3 deletions

View file

@ -49,6 +49,11 @@ telemetry_events = Table(
Column("cpu_pct", Float, nullable=True),
Column("mem_handles", Integer, nullable=True),
Column("latency_ms", Float, nullable=True),
# Server-side receive time. The `timestamp` column above is the CLIENT's
# self-reported wall clock and drifts up to ~90s across machines, so the
# "online" window must use this server-stamped value instead (see /live
# cache query). Nullable so pre-migration rows fall back to `timestamp`.
Column("received_at", DateTime(timezone=True), nullable=True),
)
# Composite index to accelerate Grafana queries filtering by character_name then ordering by timestamp
Index(
@ -256,6 +261,18 @@ async def init_db_async():
print(
f"Warning: failed to create composite index ix_telemetry_events_char_ts: {e}"
)
# Add the server-receive-time column to existing deployments (idempotent).
# Used as the clock-skew-proof basis for the "online" window in /live.
try:
with engine.connect() as conn:
conn.execute(
text(
"ALTER TABLE telemetry_events "
"ADD COLUMN IF NOT EXISTS received_at TIMESTAMPTZ"
)
)
except Exception as e:
print(f"Warning: failed to add telemetry_events.received_at column: {e}")
# Add retention and compression policies on the hypertable
try:
with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn: