Production Runbook
Operational reference for running Race Platform in production. Pair with Deployment for the architecture context — this page is the "what to do when things break" companion.
On-call escalation today is @revanth (the founder). When a larger team comes online this list grows; for now treat any incident that survives the playbook below as a page.
Stack at a glance
| Layer | Production | Local |
|---|---|---|
| API | Cloudflare Workers (race-api) | Node tsx watch in container race-api |
| Realtime | Cloudflare Durable Objects | Node ws in container race-realtime |
| DB | Supabase Postgres + Auth + Realtime | Container race-postgres |
| Object storage | Cloudflare R2 | Container race-minio |
| KV / pub-sub | Cloudflare KV | Container race-redis |
| Queue | Cloudflare Queues | (in-process fire-and-forget) |
| TLS termination | Cloudflare edge | Traefik (Let's Encrypt) on the host |
| Web client | Cloudflare Pages | Container race-web :6166 |
| Docs | Cloudflare Pages | Container race-docs :6167 |
See Deployment for the full driver matrix.
The smoke test — first response to any incident
tools/smoke-test.sh is the canonical "is the stack alive"
probe. Run it before doing anything else; if it passes the
problem is narrower than total outage.
make smoke
# or
API_URL=https://api.race.example.com ./tools/smoke-test.sh
The smoke test exercises 6 end-to-end paths:
/healthreturns ok- Login as
jamie@apexracing.ioreturns a JWT GET /eventsreturns the seeded Sebring eventGET /run-sheetsreturns the seeded run sheetGET /lapsreturns the seeded laps- WebSocket session-room handshake completes
Any failed step prints the curl payload so the failure mode is self-diagnosing in 90% of cases.
The status pill
Every authenticated screen surfaces a sync status pill in the shell's status strip (top-right of the ribbon row). Three states:
| Pill | Meaning | Action |
|---|---|---|
| Online (solid green dot) | API + WS reachable, last successful poll < 30 s ago | None — system healthy |
| Reconnecting (pulsing amber dot) | Last poll failed but still under the 5-attempt retry window | Wait 30 s; if it persists, page on-call |
| Offline (solid red dot) | Five consecutive failed polls; client is in offline-cache mode | Run the smoke test; if green, refresh the client |
Click the pill to see the last-seen timestamps for /health,
/auth/me, and the WS connection. The connection-status banner
that appears below the pill on Offline carries the most recent
error message — copy that into the incident channel before
clearing.
Common operational tasks
Rotate the JWT signing secret
The JWT_SECRET env var is read at API cold-start and used to
sign / verify every Bearer token.
# 1. Generate a strong secret
openssl rand -hex 64
# 2. Set it on the production API
wrangler secret put JWT_SECRET --name race-api
# 3. Roll the API workers (Cloudflare deploys are atomic)
wrangler deploy --name race-api
All existing JWTs are invalidated immediately. Users must
re-authenticate. API keys (X-API-Key: race_…) are not affected
— they're hashed against keyHash, not signed.
Replay failed webhook deliveries
The webhook dispatcher persists each delivery before the first POST attempt. After three failed retries, the row is final until manually replayed.
In the admin UI:
- Admin → Integrations → Webhooks
- Pick the failing subscription
- Open the Recent Deliveries drill-down
- Click the refresh icon next to any failed row
Programmatically:
# List failed deliveries
curl -H "Authorization: Bearer $TOKEN" \
"https://api.race.example.com/webhooks/$WEBHOOK_ID/deliveries?limit=500" \
| jq '.items[] | select(.succeededAt == null) | .id'
# Replay
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.race.example.com/webhooks/$WEBHOOK_ID/deliveries/$DELIVERY_ID/replay"
See Reference → Webhook Events for the retry / backoff specifics.
Clear the plugin cache
The plugin host caches compiled bundles in-memory per Worker
isolate (pluginRuntime.compileCache). After a plugin upload
the next invocation re-compiles, but if a bad bundle gets
cached you can force a flush:
# Production — restart the Workers (rolls the isolates)
wrangler deploy --name race-api
# Local — restart the api container
docker-compose -f infra/docker/docker-compose.yml restart api
There's no /plugins/cache/clear route by design — the cache is
process-local and doesn't survive a restart, so a redeploy is
the canonical "drop everything cached" hammer.
Restore from backup
This is currently a roadmap item. Phase 3 ships without automated point-in-time backups; Supabase's daily PITR is in play once we cut over from local Postgres.
Manual recovery procedure for the local stack (runs on the container's named volume):
# Take a snapshot
docker exec race-postgres pg_dump -U race race \
| gzip > backup-$(date +%Y%m%d).sql.gz
# Restore (destructive)
docker exec -i race-postgres psql -U race -d postgres \
-c "DROP DATABASE IF EXISTS race; CREATE DATABASE race;"
gunzip < backup-20260605.sql.gz \
| docker exec -i race-postgres psql -U race race
For the production target (Supabase) the equivalent flow is
their PITR console — see Supabase docs for the click-path.
The runbook will grow a make backup and make restore once
the production cutover happens.
Re-run migrations
Migrations are append-only — never edit a committed file. Generate a new one and apply:
cd apps/api
DATABASE_URL=... pnpm exec drizzle-kit generate # creates a new SQL file
make migrate # applies pending files
Production deploys run migrations as a pre-deploy step in CI.
Manual application from a workstation requires the production
DATABASE_URL; treat that secret like a master key.
Re-seed demo data (local only)
./teardown.sh --wipe # nukes volumes
./setup.sh # brings everything back up + migrates + seeds
Never run make seed against production — it inserts a
hardcoded jamie@apexracing.io user.
Incident playbooks
API down (5xx storm or no response)
- Run the smoke test. If
/health200s but/events5xxs the issue is downstream of the Worker (DB, R2, KV). - Check the API logs.
- Production:
wrangler tail race-api --format pretty - Local:
docker-compose -f infra/docker/docker-compose.yml logs -f api
- Production:
- Common signatures:
getaddrinfo ENOTFOUND→ DNS / network from the Workerrelation "x" does not exist→ migrations didn't apply →make migrateJWT_SECRET is required→ env var missing on cold start
- If the DB is up but the API is wedged, a single Worker
restart usually clears it (
wrangler deployordocker-compose restart api). - Page @revanth if the smoke test is still red after a restart.