92 lines
3.0 KiB
Markdown
92 lines
3.0 KiB
Markdown
# Production Runbook - WhatsApp Inbox
|
||
|
||
## 1) Normal Deployment
|
||
|
||
- Deploy code.
|
||
- Run migration:
|
||
- `npm run db:deploy`
|
||
- Update environment variables and secrets.
|
||
- Start app service.
|
||
- Start retry worker (`daemon` or cron).
|
||
- Start maintenance cleanup (`npm run ops:maintenance`) on a periodic schedule, e.g. daily.
|
||
- Run readiness:
|
||
- `npm run ops:readiness`
|
||
- Monitor:
|
||
- `GET /api/health` returns `ok`/`degraded` not `down`.
|
||
- Super Admin `alerts` and `webhook-logs` empty dari critical spike baru.
|
||
|
||
## 2) Incident Response
|
||
|
||
### 2.1 Severity Triage
|
||
|
||
- **P0 (Critical)**: service tidak bisa melayani request, `health.status=down`, atau DB tidak bisa diakses.
|
||
- **P1 (High)**: retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
|
||
- **P2 (Medium)**: banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
|
||
- **P3 (Low)**: degraded ringan dan bukan blocking.
|
||
|
||
### 2.2 Immediate Actions
|
||
|
||
1. Cek health dan readiness:
|
||
- `npm run ops:healthcheck`
|
||
- `npm run ops:readiness`
|
||
2. Ambil snapshot insiden:
|
||
- `npm run ops:incident`
|
||
3. Cek halaman Super Admin:
|
||
- `/super-admin/alerts`
|
||
- `/super-admin/webhook-logs`
|
||
4. Cek worker retry:
|
||
- `GET /api/jobs/campaign-retry?token=<token>`
|
||
5. Jika retry lock stuck (`isStaleLock`/`consecutiveFailures` tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.
|
||
|
||
## 3) Recovery Actions
|
||
|
||
### Recovery: Retry Worker Stuck
|
||
|
||
- Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
|
||
- Pastikan token masih valid.
|
||
- Jalankan:
|
||
- `npm run job:campaign-retry` (one-shot) dan pantau logs.
|
||
|
||
### Recovery: Webhook Spike
|
||
|
||
- Periksa provider/secret webhook.
|
||
- Konfirmasi endpoint webhook dan signature key:
|
||
- `WHATSAPP_WEBHOOK_SECRET`
|
||
- `WHATSAPP_WEBHOOK_VERIFY_TOKEN`
|
||
- Periksa event backlog:
|
||
- `WebhookEvents` berstatus `failed` meningkat → ambil detail event terakhir.
|
||
|
||
### Recovery: Channel Disconnected
|
||
|
||
- Periksa channel di Super Admin > Channel.
|
||
- Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
|
||
- Trigger reconnect action sesuai SOP integrasi channel.
|
||
|
||
## 4) Rollback
|
||
|
||
Gunakan rollback jika bug berdampak layanan dalam 10–15 menit setelah deploy dan tidak bisa di-fix cepat.
|
||
|
||
### Langkah
|
||
|
||
1. Pin commit deploy sebelumnya.
|
||
2. Deploy rollback image/commit.
|
||
3. Stop new workers version lama/baru:
|
||
- restart service agar memakai bundle baru.
|
||
4. Run:
|
||
- `npm run db:deploy` (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).
|
||
- `npm run ops:readiness`
|
||
5. Monitor:
|
||
- `GET /api/health`
|
||
- Super Admin alerts 15 menit pertama.
|
||
|
||
### Catatan Migrasi
|
||
|
||
- Prioritas rollback database: hindari `prisma migrate reset` di production.
|
||
- Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.
|
||
|
||
## 5) Communication Template
|
||
|
||
- P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
|
||
- status, start time, impact, ETA, update per 10 menit.
|
||
- P2: update internal setelah identifikasi penyebab + estimasi perbaikan.
|