Production Runbook

Operational reference for running Race Platform in production. Pair with Deployment for the architecture context — this page is the "what to do when things break" companion.

On-call escalation today is @revanth (the founder). When a larger team comes online this list grows; for now treat any incident that survives the playbook below as a page.

Stack at a glance

Layer	Production	Local
API	Cloudflare Workers (`race-api`)	Node `tsx watch` in container `race-api`
Realtime	Cloudflare Durable Objects	Node `ws` in container `race-realtime`
DB	Supabase Postgres + Auth + Realtime	Container `race-postgres`
Object storage	Cloudflare R2	Container `race-minio`
KV / pub-sub	Cloudflare KV	Container `race-redis`
Queue	Cloudflare Queues	(in-process fire-and-forget)
TLS termination	Cloudflare edge	Traefik (Let's Encrypt) on the host
Web client	Cloudflare Pages	Container `race-web` :6166
Docs	Cloudflare Pages	Container `race-docs` :6167

See Deployment for the full driver matrix.

The smoke test — first response to any incident

tools/smoke-test.sh is the canonical "is the stack alive" probe. Run it before doing anything else; if it passes the problem is narrower than total outage.

make smoke
# or
API_URL=https://api.race.example.com ./tools/smoke-test.sh

The smoke test exercises 6 end-to-end paths:

/health returns ok
Login as jamie@apexracing.io returns a JWT
GET /events returns the seeded Sebring event
GET /run-sheets returns the seeded run sheet
GET /laps returns the seeded laps
WebSocket session-room handshake completes

Any failed step prints the curl payload so the failure mode is self-diagnosing in 90% of cases.

The status pill

Every authenticated screen surfaces a sync status pill in the shell's status strip (top-right of the ribbon row). Three states:

Pill	Meaning	Action
Online (solid green dot)	API + WS reachable, last successful poll < 30 s ago	None — system healthy
Reconnecting (pulsing amber dot)	Last poll failed but still under the 5-attempt retry window	Wait 30 s; if it persists, page on-call
Offline (solid red dot)	Five consecutive failed polls; client is in offline-cache mode	Run the smoke test; if green, refresh the client

Click the pill to see the last-seen timestamps for /health, /auth/me, and the WS connection. The connection-status banner that appears below the pill on Offline carries the most recent error message — copy that into the incident channel before clearing.

Common operational tasks

Rotate the JWT signing secret

The JWT_SECRET env var is read at API cold-start and used to sign / verify every Bearer token.

# 1. Generate a strong secret
openssl rand -hex 64

# 2. Set it on the production API
wrangler secret put JWT_SECRET --name race-api

# 3. Roll the API workers (Cloudflare deploys are atomic)
wrangler deploy --name race-api

All existing JWTs are invalidated immediately. Users must re-authenticate. API keys (X-API-Key: race_…) are not affected — they're hashed against keyHash, not signed.

Replay failed webhook deliveries

The webhook dispatcher persists each delivery before the first POST attempt. After three failed retries, the row is final until manually replayed.

In the admin UI:

Admin → Integrations → Webhooks
Pick the failing subscription
Open the Recent Deliveries drill-down
Click the refresh icon next to any failed row

Programmatically:

# List failed deliveries
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.race.example.com/webhooks/$WEBHOOK_ID/deliveries?limit=500" \
  | jq '.items[] | select(.succeededAt == null) | .id'

# Replay
curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://api.race.example.com/webhooks/$WEBHOOK_ID/deliveries/$DELIVERY_ID/replay"

See Reference → Webhook Events for the retry / backoff specifics.

Clear the plugin cache

The plugin host caches compiled bundles in-memory per Worker isolate (pluginRuntime.compileCache). After a plugin upload the next invocation re-compiles, but if a bad bundle gets cached you can force a flush:

# Production — restart the Workers (rolls the isolates)
wrangler deploy --name race-api

# Local — restart the api container
docker-compose -f infra/docker/docker-compose.yml restart api

There's no /plugins/cache/clear route by design — the cache is process-local and doesn't survive a restart, so a redeploy is the canonical "drop everything cached" hammer.

Restore from backup

This is currently a roadmap item. Phase 3 ships without automated point-in-time backups; Supabase's daily PITR is in play once we cut over from local Postgres.

Manual recovery procedure for the local stack (runs on the container's named volume):

# Take a snapshot
docker exec race-postgres pg_dump -U race race \
  | gzip > backup-$(date +%Y%m%d).sql.gz

# Restore (destructive)
docker exec -i race-postgres psql -U race -d postgres \
  -c "DROP DATABASE IF EXISTS race; CREATE DATABASE race;"
gunzip < backup-20260605.sql.gz \
  | docker exec -i race-postgres psql -U race race

For the production target (Supabase) the equivalent flow is their PITR console — see Supabase docs for the click-path. The runbook will grow a make backup and make restore once the production cutover happens.

Re-run migrations

Migrations are append-only — never edit a committed file. Generate a new one and apply:

cd apps/api
DATABASE_URL=... pnpm exec drizzle-kit generate   # creates a new SQL file
make migrate                                       # applies pending files

Production deploys run migrations as a pre-deploy step in CI. Manual application from a workstation requires the production DATABASE_URL; treat that secret like a master key.

Re-seed demo data (local only)

./teardown.sh --wipe   # nukes volumes
./setup.sh             # brings everything back up + migrates + seeds

Never run make seed against production — it inserts a hardcoded jamie@apexracing.io user.

Incident playbooks

API down (5xx storm or no response)

Run the smoke test. If /health 200s but /events 5xxs the issue is downstream of the Worker (DB, R2, KV).
Check the API logs.
- Production: wrangler tail race-api --format pretty
- Local: docker-compose -f infra/docker/docker-compose.yml logs -f api
Common signatures:
- getaddrinfo ENOTFOUND → DNS / network from the Worker
- relation "x" does not exist → migrations didn't apply → make migrate
- JWT_SECRET is required → env var missing on cold start
If the DB is up but the API is wedged, a single Worker restart usually clears it (wrangler deploy or docker-compose restart api).
Page @revanth if the smoke test is still red after a restart.

DB down (Postgres / Supabase)

Check Supabase status page (production) or docker-compose ps race-postgres (local).

Local recovery:

docker-compose -f infra/docker/docker-compose.yml restart postgres
make migrate

Production recovery: open Supabase support if the project is reporting unhealthy. Most outages are read-only mode triggered by quota — check the dashboard's connection-pool / disk usage panels first.
The API caches nothing across restarts — bringing the DB back restores service immediately. No client-side migration needed.

TLS / certificate issues

Production certs are managed by Cloudflare at the edge — nothing to renew. Local Traefik renews Let's Encrypt automatically; if it lapses:

docker-compose -f infra/docker/docker-compose.yml restart traefik
docker-compose -f infra/docker/docker-compose.yml logs traefik | grep -i acme

The Traefik logs will reveal the renewal path. Most failures are DNS validation issues (the dev domain points at the wrong IP) or rate-limit lockouts (5 failed renewals in 1 hour locks the host out for an hour).

Realtime room dropping subscribers

WebSocket sessions are scoped per session-room. Symptoms: clients disconnect every ~60 s, multi-user editing stops propagating.

Check the realtime container is alive: docker-compose ps race-realtime
Tail logs for subscriber timeout / pong missed lines: docker-compose logs -f realtime
Restart the Durable Object / ws server: docker-compose restart realtime (local) / wrangler deploy --name race-realtime (prod)
Clients auto-reconnect with exponential backoff — no client-side intervention needed.

Useful one-liners

# Top-most-recent errors in the API
docker-compose logs --tail=200 api | grep -iE "error|warn"

# How many open issues are critical right now
curl -s -H "Authorization: Bearer $TOKEN" \
  "$API_URL/issues?severity=critical&stateId=todo" \
  | jq '.items | length'

# Webhook subscriptions that haven't succeeded in 24 h
curl -s -H "Authorization: Bearer $TOKEN" "$API_URL/webhooks" \
  | jq '.items[] | select(.lastDeliveryStatus != null and .lastDeliveryStatus >= 400)'

# DB row counts for a fresh-instance sanity check
docker exec race-postgres psql -U race -d race \
  -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 20;"

On-call escalation

Today: @revanth (Slack DM or email).

Future tiers (once the team grows):

L1 — on-call engineer, 15 min response SLA for sev-1
L2 — platform engineer, 1 hour response SLA for sev-2
L3 — founder (@revanth), final escalation

Always file a progress.md learnings entry after the incident is closed, and (if the playbook missed something) update this page in the same PR.

What's coming

Automated daily backups with PITR for the production DB
Synthetic monitoring running the smoke test every 5 min from multiple regions
Per-account quota alerts when an account approaches the rate-limit ceiling
PagerDuty / Opsgenie hookup once on-call rotates beyond one person

Stack at a glance​

The smoke test — first response to any incident​

The status pill​

Common operational tasks​

Rotate the JWT signing secret​

Replay failed webhook deliveries​

Clear the plugin cache​

Restore from backup​

Re-run migrations​

Re-seed demo data (local only)​

Incident playbooks​

API down (5xx storm or no response)​

DB down (Postgres / Supabase)​

TLS / certificate issues​

Realtime room dropping subscribers​

Useful one-liners​

On-call escalation​

What's coming​