Some checks are pending
CI - Production Readiness / Verify (push) Waiting to run
3.4 KiB
3.4 KiB
Production Runbook - WhatsApp Inbox
1) Normal Deployment
- Deploy code:
- optional pre-step on server:
git pullnpm cinpm run build
- Run migration:
npm run db:deploy
- Optional schema safety (if DB changes):
npm run ops:readiness
- Update environment variables and secrets.
- Start or restart app service:
npm run ops:safe-restart
- Start retry worker (
daemonor cron). - Start maintenance cleanup (
npm run ops:maintenance) on a periodic schedule, e.g. daily. - Verify auth session baseline (24h check target default can be tuned via
SESSION_TTL_SECONDS):- set
OPS_SESSION_CHECK_EMAILdanOPS_SESSION_CHECK_PASSWORDdi.env npm run ops:session-check
- set
- Run readiness:
npm run ops:readiness
- Monitor:
GET /api/healthreturnsok/degradednotdown.- Super Admin
alertsandwebhook-logsempty dari critical spike baru.
2) Incident Response
2.1 Severity Triage
- P0 (Critical): service tidak bisa melayani request,
health.status=down, atau DB tidak bisa diakses. - P1 (High): retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
- P2 (Medium): banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
- P3 (Low): degraded ringan dan bukan blocking.
2.2 Immediate Actions
- Cek health dan readiness:
npm run ops:healthchecknpm run ops:readiness
- Ambil snapshot insiden:
npm run ops:incident
- Cek halaman Super Admin:
/super-admin/alerts/super-admin/webhook-logs
- Cek worker retry:
GET /api/jobs/campaign-retry?token=<token>
- Jika retry lock stuck (
isStaleLock/consecutiveFailurestinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.
3) Recovery Actions
Recovery: Retry Worker Stuck
- Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
- Pastikan token masih valid.
- Jalankan:
npm run job:campaign-retry(one-shot) dan pantau logs.
Recovery: Webhook Spike
- Periksa provider/secret webhook.
- Konfirmasi endpoint webhook dan signature key:
WHATSAPP_WEBHOOK_SECRETWHATSAPP_WEBHOOK_VERIFY_TOKEN
- Periksa event backlog:
WebhookEventsberstatusfailedmeningkat → ambil detail event terakhir.
Recovery: Channel Disconnected
- Periksa channel di Super Admin > Channel.
- Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
- Trigger reconnect action sesuai SOP integrasi channel.
4) Rollback
Gunakan rollback jika bug berdampak layanan dalam 10–15 menit setelah deploy dan tidak bisa di-fix cepat.
Langkah
- Pin commit deploy sebelumnya.
- Deploy rollback image/commit.
- Stop new workers version lama/baru:
- restart service agar memakai bundle baru.
- Run:
npm run db:deploy(verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).npm run ops:readiness
- Monitor:
GET /api/health- Super Admin alerts 15 menit pertama.
Catatan Migrasi
- Prioritas rollback database: hindari
prisma migrate resetdi production. - Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.
5) Communication Template
- P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
- status, start time, impact, ETA, update per 10 menit.
- P2: update internal setelah identifikasi penyebab + estimasi perbaikan.