chore: initial project import
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
This commit is contained in:
91
ops-runbook.md
Normal file
91
ops-runbook.md
Normal file
@ -0,0 +1,91 @@
|
||||
# Production Runbook - WhatsApp Inbox
|
||||
|
||||
## 1) Normal Deployment
|
||||
|
||||
- Deploy code.
|
||||
- Run migration:
|
||||
- `npm run db:deploy`
|
||||
- Update environment variables and secrets.
|
||||
- Start app service.
|
||||
- Start retry worker (`daemon` or cron).
|
||||
- Start maintenance cleanup (`npm run ops:maintenance`) on a periodic schedule, e.g. daily.
|
||||
- Run readiness:
|
||||
- `npm run ops:readiness`
|
||||
- Monitor:
|
||||
- `GET /api/health` returns `ok`/`degraded` not `down`.
|
||||
- Super Admin `alerts` and `webhook-logs` empty dari critical spike baru.
|
||||
|
||||
## 2) Incident Response
|
||||
|
||||
### 2.1 Severity Triage
|
||||
|
||||
- **P0 (Critical)**: service tidak bisa melayani request, `health.status=down`, atau DB tidak bisa diakses.
|
||||
- **P1 (High)**: retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
|
||||
- **P2 (Medium)**: banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
|
||||
- **P3 (Low)**: degraded ringan dan bukan blocking.
|
||||
|
||||
### 2.2 Immediate Actions
|
||||
|
||||
1. Cek health dan readiness:
|
||||
- `npm run ops:healthcheck`
|
||||
- `npm run ops:readiness`
|
||||
2. Ambil snapshot insiden:
|
||||
- `npm run ops:incident`
|
||||
3. Cek halaman Super Admin:
|
||||
- `/super-admin/alerts`
|
||||
- `/super-admin/webhook-logs`
|
||||
4. Cek worker retry:
|
||||
- `GET /api/jobs/campaign-retry?token=<token>`
|
||||
5. Jika retry lock stuck (`isStaleLock`/`consecutiveFailures` tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.
|
||||
|
||||
## 3) Recovery Actions
|
||||
|
||||
### Recovery: Retry Worker Stuck
|
||||
|
||||
- Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
|
||||
- Pastikan token masih valid.
|
||||
- Jalankan:
|
||||
- `npm run job:campaign-retry` (one-shot) dan pantau logs.
|
||||
|
||||
### Recovery: Webhook Spike
|
||||
|
||||
- Periksa provider/secret webhook.
|
||||
- Konfirmasi endpoint webhook dan signature key:
|
||||
- `WHATSAPP_WEBHOOK_SECRET`
|
||||
- `WHATSAPP_WEBHOOK_VERIFY_TOKEN`
|
||||
- Periksa event backlog:
|
||||
- `WebhookEvents` berstatus `failed` meningkat → ambil detail event terakhir.
|
||||
|
||||
### Recovery: Channel Disconnected
|
||||
|
||||
- Periksa channel di Super Admin > Channel.
|
||||
- Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
|
||||
- Trigger reconnect action sesuai SOP integrasi channel.
|
||||
|
||||
## 4) Rollback
|
||||
|
||||
Gunakan rollback jika bug berdampak layanan dalam 10–15 menit setelah deploy dan tidak bisa di-fix cepat.
|
||||
|
||||
### Langkah
|
||||
|
||||
1. Pin commit deploy sebelumnya.
|
||||
2. Deploy rollback image/commit.
|
||||
3. Stop new workers version lama/baru:
|
||||
- restart service agar memakai bundle baru.
|
||||
4. Run:
|
||||
- `npm run db:deploy` (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).
|
||||
- `npm run ops:readiness`
|
||||
5. Monitor:
|
||||
- `GET /api/health`
|
||||
- Super Admin alerts 15 menit pertama.
|
||||
|
||||
### Catatan Migrasi
|
||||
|
||||
- Prioritas rollback database: hindari `prisma migrate reset` di production.
|
||||
- Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.
|
||||
|
||||
## 5) Communication Template
|
||||
|
||||
- P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
|
||||
- status, start time, impact, ETA, update per 10 menit.
|
||||
- P2: update internal setelah identifikasi penyebab + estimasi perbaikan.
|
||||
Reference in New Issue
Block a user