# Production Runbook - WhatsApp Inbox ## 1) Normal Deployment - Deploy code: - optional pre-step on server: - `git pull` - `npm ci` - `npm run build` - Run migration: - `npm run db:deploy` - Optional schema safety (if DB changes): - `npm run ops:readiness` - Update environment variables and secrets. - Start or restart app service: - `npm run ops:safe-restart` - Start retry worker (`daemon` or cron). - Start maintenance cleanup (`npm run ops:maintenance`) on a periodic schedule, e.g. daily. - Verify auth session baseline (24h check target default can be tuned via `SESSION_TTL_SECONDS`): - set `OPS_SESSION_CHECK_EMAIL` dan `OPS_SESSION_CHECK_PASSWORD` di `.env` - `npm run ops:session-check` - Run readiness: - `npm run ops:readiness` - Monitor: - `GET /api/health` returns `ok`/`degraded` not `down`. - Super Admin `alerts` and `webhook-logs` empty dari critical spike baru. ## 2) Incident Response ### 2.1 Severity Triage - **P0 (Critical)**: service tidak bisa melayani request, `health.status=down`, atau DB tidak bisa diakses. - **P1 (High)**: retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected. - **P2 (Medium)**: banyak notifikasi gagal/terlambat tapi fungsional masih berjalan. - **P3 (Low)**: degraded ringan dan bukan blocking. ### 2.2 Immediate Actions 1. Cek health dan readiness: - `npm run ops:healthcheck` - `npm run ops:readiness` 2. Ambil snapshot insiden: - `npm run ops:incident` 3. Cek halaman Super Admin: - `/super-admin/alerts` - `/super-admin/webhook-logs` 4. Cek worker retry: - `GET /api/jobs/campaign-retry?token=` 5. Jika retry lock stuck (`isStaleLock`/`consecutiveFailures` tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut. ## 3) Recovery Actions ### Recovery: Retry Worker Stuck - Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover). - Pastikan token masih valid. - Jalankan: - `npm run job:campaign-retry` (one-shot) dan pantau logs. ### Recovery: Webhook Spike - Periksa provider/secret webhook. - Konfirmasi endpoint webhook dan signature key: - `WHATSAPP_WEBHOOK_SECRET` - `WHATSAPP_WEBHOOK_VERIFY_TOKEN` - Periksa event backlog: - `WebhookEvents` berstatus `failed` meningkat → ambil detail event terakhir. ### Recovery: Channel Disconnected - Periksa channel di Super Admin > Channel. - Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid. - Trigger reconnect action sesuai SOP integrasi channel. ## 4) Rollback Gunakan rollback jika bug berdampak layanan dalam 10–15 menit setelah deploy dan tidak bisa di-fix cepat. ### Langkah 1. Pin commit deploy sebelumnya. 2. Deploy rollback image/commit. 3. Stop new workers version lama/baru: - restart service agar memakai bundle baru. 4. Run: - `npm run db:deploy` (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan). - `npm run ops:readiness` 5. Monitor: - `GET /api/health` - Super Admin alerts 15 menit pertama. ### Catatan Migrasi - Prioritas rollback database: hindari `prisma migrate reset` di production. - Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process. ## 5) Communication Template - P0/P1: broadcast ke channel #incident + Slack/WhatsApp: - status, start time, impact, ETA, update per 10 menit. - P2: update internal setelah identifikasi penyebab + estimasi perbaikan.