Skip to main content

Production Runbook

Operational reference for running Race Platform in production. Pair with Deployment for the architecture context — this page is the "what to do when things break" companion.

On-call escalation today is @revanth (the founder). When a larger team comes online this list grows; for now treat any incident that survives the playbook below as a page.

Stack at a glance

LayerProductionLocal
APICloudflare Workers (race-api)Node tsx watch in container race-api
RealtimeCloudflare Durable ObjectsNode ws in container race-realtime
DBSupabase Postgres + Auth + RealtimeContainer race-postgres
Object storageCloudflare R2Container race-minio
KV / pub-subCloudflare KVContainer race-redis
QueueCloudflare Queues(in-process fire-and-forget)
TLS terminationCloudflare edgeTraefik (Let's Encrypt) on the host
Web clientCloudflare PagesContainer race-web :6166
DocsCloudflare PagesContainer race-docs :6167

See Deployment for the full driver matrix.

The smoke test — first response to any incident

tools/smoke-test.sh is the canonical "is the stack alive" probe. Run it before doing anything else; if it passes the problem is narrower than total outage.

make smoke
# or
API_URL=https://api.race.example.com ./tools/smoke-test.sh

The smoke test exercises 6 end-to-end paths:

  1. /health returns ok
  2. Login as jamie@apexracing.io returns a JWT
  3. GET /events returns the seeded Sebring event
  4. GET /run-sheets returns the seeded run sheet
  5. GET /laps returns the seeded laps
  6. WebSocket session-room handshake completes

Any failed step prints the curl payload so the failure mode is self-diagnosing in 90% of cases.

The status pill

Every authenticated screen surfaces a sync status pill in the shell's status strip (top-right of the ribbon row). Three states:

PillMeaningAction
Online (solid green dot)API + WS reachable, last successful poll < 30 s agoNone — system healthy
Reconnecting (pulsing amber dot)Last poll failed but still under the 5-attempt retry windowWait 30 s; if it persists, page on-call
Offline (solid red dot)Five consecutive failed polls; client is in offline-cache modeRun the smoke test; if green, refresh the client

Click the pill to see the last-seen timestamps for /health, /auth/me, and the WS connection. The connection-status banner that appears below the pill on Offline carries the most recent error message — copy that into the incident channel before clearing.

Common operational tasks

Rotate the JWT signing secret

The JWT_SECRET env var is read at API cold-start and used to sign / verify every Bearer token.

# 1. Generate a strong secret
openssl rand -hex 64

# 2. Set it on the production API
wrangler secret put JWT_SECRET --name race-api

# 3. Roll the API workers (Cloudflare deploys are atomic)
wrangler deploy --name race-api

All existing JWTs are invalidated immediately. Users must re-authenticate. API keys (X-API-Key: race_…) are not affected — they're hashed against keyHash, not signed.

Replay failed webhook deliveries

The webhook dispatcher persists each delivery before the first POST attempt. After three failed retries, the row is final until manually replayed.

In the admin UI:

  1. Admin → Integrations → Webhooks
  2. Pick the failing subscription
  3. Open the Recent Deliveries drill-down
  4. Click the refresh icon next to any failed row

Programmatically:

# List failed deliveries
curl -H "Authorization: Bearer $TOKEN" \
"https://api.race.example.com/webhooks/$WEBHOOK_ID/deliveries?limit=500" \
| jq '.items[] | select(.succeededAt == null) | .id'

# Replay
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.race.example.com/webhooks/$WEBHOOK_ID/deliveries/$DELIVERY_ID/replay"

See Reference → Webhook Events for the retry / backoff specifics.

Clear the plugin cache

The plugin host caches compiled bundles in-memory per Worker isolate (pluginRuntime.compileCache). After a plugin upload the next invocation re-compiles, but if a bad bundle gets cached you can force a flush:

# Production — restart the Workers (rolls the isolates)
wrangler deploy --name race-api

# Local — restart the api container
docker-compose -f infra/docker/docker-compose.yml restart api

There's no /plugins/cache/clear route by design — the cache is process-local and doesn't survive a restart, so a redeploy is the canonical "drop everything cached" hammer.

Restore from backup

This is currently a roadmap item. Phase 3 ships without automated point-in-time backups; Supabase's daily PITR is in play once we cut over from local Postgres.

Manual recovery procedure for the local stack (runs on the container's named volume):

# Take a snapshot
docker exec race-postgres pg_dump -U race race \
| gzip > backup-$(date +%Y%m%d).sql.gz

# Restore (destructive)
docker exec -i race-postgres psql -U race -d postgres \
-c "DROP DATABASE IF EXISTS race; CREATE DATABASE race;"
gunzip < backup-20260605.sql.gz \
| docker exec -i race-postgres psql -U race race

For the production target (Supabase) the equivalent flow is their PITR console — see Supabase docs for the click-path. The runbook will grow a make backup and make restore once the production cutover happens.

Re-run migrations

Migrations are append-only — never edit a committed file. Generate a new one and apply:

cd apps/api
DATABASE_URL=... pnpm exec drizzle-kit generate # creates a new SQL file
make migrate # applies pending files

Production deploys run migrations as a pre-deploy step in CI. Manual application from a workstation requires the production DATABASE_URL; treat that secret like a master key.

Re-seed demo data (local only)

./teardown.sh --wipe # nukes volumes
./setup.sh # brings everything back up + migrates + seeds

Never run make seed against production — it inserts a hardcoded jamie@apexracing.io user.

Incident playbooks

API down (5xx storm or no response)

  1. Run the smoke test. If /health 200s but /events 5xxs the issue is downstream of the Worker (DB, R2, KV).
  2. Check the API logs.
    • Production: wrangler tail race-api --format pretty
    • Local: docker-compose -f infra/docker/docker-compose.yml logs -f api
  3. Common signatures:
    • getaddrinfo ENOTFOUND → DNS / network from the Worker
    • relation "x" does not exist → migrations didn't apply → make migrate
    • JWT_SECRET is required → env var missing on cold start
  4. If the DB is up but the API is wedged, a single Worker restart usually clears it (wrangler deploy or docker-compose restart api).
  5. Page @revanth if the smoke test is still red after a restart.

DB down (Postgres / Supabase)

  1. Check Supabase status page (production) or docker-compose ps race-postgres (local).
  2. Local recovery:
    docker-compose -f infra/docker/docker-compose.yml restart postgres
    make migrate
  3. Production recovery: open Supabase support if the project is reporting unhealthy. Most outages are read-only mode triggered by quota — check the dashboard's connection-pool / disk usage panels first.
  4. The API caches nothing across restarts — bringing the DB back restores service immediately. No client-side migration needed.

TLS / certificate issues

Production certs are managed by Cloudflare at the edge — nothing to renew. Local Traefik renews Let's Encrypt automatically; if it lapses:

docker-compose -f infra/docker/docker-compose.yml restart traefik
docker-compose -f infra/docker/docker-compose.yml logs traefik | grep -i acme

The Traefik logs will reveal the renewal path. Most failures are DNS validation issues (the dev domain points at the wrong IP) or rate-limit lockouts (5 failed renewals in 1 hour locks the host out for an hour).

Realtime room dropping subscribers

WebSocket sessions are scoped per session-room. Symptoms: clients disconnect every ~60 s, multi-user editing stops propagating.

  1. Check the realtime container is alive: docker-compose ps race-realtime
  2. Tail logs for subscriber timeout / pong missed lines: docker-compose logs -f realtime
  3. Restart the Durable Object / ws server: docker-compose restart realtime (local) / wrangler deploy --name race-realtime (prod)
  4. Clients auto-reconnect with exponential backoff — no client-side intervention needed.

Useful one-liners

# Top-most-recent errors in the API
docker-compose logs --tail=200 api | grep -iE "error|warn"

# How many open issues are critical right now
curl -s -H "Authorization: Bearer $TOKEN" \
"$API_URL/issues?severity=critical&stateId=todo" \
| jq '.items | length'

# Webhook subscriptions that haven't succeeded in 24 h
curl -s -H "Authorization: Bearer $TOKEN" "$API_URL/webhooks" \
| jq '.items[] | select(.lastDeliveryStatus != null and .lastDeliveryStatus >= 400)'

# DB row counts for a fresh-instance sanity check
docker exec race-postgres psql -U race -d race \
-c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 20;"

On-call escalation

Today: @revanth (Slack DM or email).

Future tiers (once the team grows):

  1. L1 — on-call engineer, 15 min response SLA for sev-1
  2. L2 — platform engineer, 1 hour response SLA for sev-2
  3. L3 — founder (@revanth), final escalation

Always file a progress.md learnings entry after the incident is closed, and (if the playbook missed something) update this page in the same PR.

What's coming

  • Automated daily backups with PITR for the production DB
  • Synthetic monitoring running the smoke test every 5 min from multiple regions
  • Per-account quota alerts when an account approaches the rate-limit ceiling
  • PagerDuty / Opsgenie hookup once on-call rotates beyond one person