Files
whatsapp-inbox-platform/ops-runbook.md
Wira Basalamah adde003fba
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
chore: initial project import
2026-04-21 09:29:29 +07:00

3.0 KiB
Raw Permalink Blame History

Production Runbook - WhatsApp Inbox

1) Normal Deployment

  • Deploy code.
  • Run migration:
    • npm run db:deploy
  • Update environment variables and secrets.
  • Start app service.
  • Start retry worker (daemon or cron).
  • Start maintenance cleanup (npm run ops:maintenance) on a periodic schedule, e.g. daily.
  • Run readiness:
    • npm run ops:readiness
  • Monitor:
    • GET /api/health returns ok/degraded not down.
    • Super Admin alerts and webhook-logs empty dari critical spike baru.

2) Incident Response

2.1 Severity Triage

  • P0 (Critical): service tidak bisa melayani request, health.status=down, atau DB tidak bisa diakses.
  • P1 (High): retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
  • P2 (Medium): banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
  • P3 (Low): degraded ringan dan bukan blocking.

2.2 Immediate Actions

  1. Cek health dan readiness:
    • npm run ops:healthcheck
    • npm run ops:readiness
  2. Ambil snapshot insiden:
    • npm run ops:incident
  3. Cek halaman Super Admin:
    • /super-admin/alerts
    • /super-admin/webhook-logs
  4. Cek worker retry:
    • GET /api/jobs/campaign-retry?token=<token>
  5. Jika retry lock stuck (isStaleLock/consecutiveFailures tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.

3) Recovery Actions

Recovery: Retry Worker Stuck

  • Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
  • Pastikan token masih valid.
  • Jalankan:
    • npm run job:campaign-retry (one-shot) dan pantau logs.

Recovery: Webhook Spike

  • Periksa provider/secret webhook.
  • Konfirmasi endpoint webhook dan signature key:
    • WHATSAPP_WEBHOOK_SECRET
    • WHATSAPP_WEBHOOK_VERIFY_TOKEN
  • Periksa event backlog:
    • WebhookEvents berstatus failed meningkat → ambil detail event terakhir.

Recovery: Channel Disconnected

  • Periksa channel di Super Admin > Channel.
  • Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
  • Trigger reconnect action sesuai SOP integrasi channel.

4) Rollback

Gunakan rollback jika bug berdampak layanan dalam 1015 menit setelah deploy dan tidak bisa di-fix cepat.

Langkah

  1. Pin commit deploy sebelumnya.
  2. Deploy rollback image/commit.
  3. Stop new workers version lama/baru:
    • restart service agar memakai bundle baru.
  4. Run:
    • npm run db:deploy (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).
    • npm run ops:readiness
  5. Monitor:
    • GET /api/health
    • Super Admin alerts 15 menit pertama.

Catatan Migrasi

  • Prioritas rollback database: hindari prisma migrate reset di production.
  • Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.

5) Communication Template

  • P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
    • status, start time, impact, ETA, update per 10 menit.
  • P2: update internal setelah identifikasi penyebab + estimasi perbaikan.