Files
whatsapp-inbox-platform/ops-runbook.md
Wira Basalamah adde003fba
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
chore: initial project import
2026-04-21 09:29:29 +07:00

92 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Production Runbook - WhatsApp Inbox
## 1) Normal Deployment
- Deploy code.
- Run migration:
- `npm run db:deploy`
- Update environment variables and secrets.
- Start app service.
- Start retry worker (`daemon` or cron).
- Start maintenance cleanup (`npm run ops:maintenance`) on a periodic schedule, e.g. daily.
- Run readiness:
- `npm run ops:readiness`
- Monitor:
- `GET /api/health` returns `ok`/`degraded` not `down`.
- Super Admin `alerts` and `webhook-logs` empty dari critical spike baru.
## 2) Incident Response
### 2.1 Severity Triage
- **P0 (Critical)**: service tidak bisa melayani request, `health.status=down`, atau DB tidak bisa diakses.
- **P1 (High)**: retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
- **P2 (Medium)**: banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
- **P3 (Low)**: degraded ringan dan bukan blocking.
### 2.2 Immediate Actions
1. Cek health dan readiness:
- `npm run ops:healthcheck`
- `npm run ops:readiness`
2. Ambil snapshot insiden:
- `npm run ops:incident`
3. Cek halaman Super Admin:
- `/super-admin/alerts`
- `/super-admin/webhook-logs`
4. Cek worker retry:
- `GET /api/jobs/campaign-retry?token=<token>`
5. Jika retry lock stuck (`isStaleLock`/`consecutiveFailures` tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.
## 3) Recovery Actions
### Recovery: Retry Worker Stuck
- Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
- Pastikan token masih valid.
- Jalankan:
- `npm run job:campaign-retry` (one-shot) dan pantau logs.
### Recovery: Webhook Spike
- Periksa provider/secret webhook.
- Konfirmasi endpoint webhook dan signature key:
- `WHATSAPP_WEBHOOK_SECRET`
- `WHATSAPP_WEBHOOK_VERIFY_TOKEN`
- Periksa event backlog:
- `WebhookEvents` berstatus `failed` meningkat → ambil detail event terakhir.
### Recovery: Channel Disconnected
- Periksa channel di Super Admin > Channel.
- Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
- Trigger reconnect action sesuai SOP integrasi channel.
## 4) Rollback
Gunakan rollback jika bug berdampak layanan dalam 1015 menit setelah deploy dan tidak bisa di-fix cepat.
### Langkah
1. Pin commit deploy sebelumnya.
2. Deploy rollback image/commit.
3. Stop new workers version lama/baru:
- restart service agar memakai bundle baru.
4. Run:
- `npm run db:deploy` (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).
- `npm run ops:readiness`
5. Monitor:
- `GET /api/health`
- Super Admin alerts 15 menit pertama.
### Catatan Migrasi
- Prioritas rollback database: hindari `prisma migrate reset` di production.
- Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.
## 5) Communication Template
- P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
- status, start time, impact, ETA, update per 10 menit.
- P2: update internal setelah identifikasi penyebab + estimasi perbaikan.