chore: initial project import

2026-04-21 09:29:29 +07:00
commit adde003fba
222 changed files with 37657 additions and 0 deletions
--- a/ops-runbook.md
+++ b/ops-runbook.md
@ -0,0 +1,91 @@
+# Production Runbook - WhatsApp Inbox
+
+## 1) Normal Deployment
+
+- Deploy code.
+- Run migration:
+  - `npm run db:deploy`
+- Update environment variables and secrets.
+- Start app service.
+- Start retry worker (`daemon` or cron).
+- Start maintenance cleanup (`npm run ops:maintenance`) on a periodic schedule, e.g. daily.
+- Run readiness:
+  - `npm run ops:readiness`
+- Monitor:
+  - `GET /api/health` returns `ok`/`degraded` not `down`.
+  - Super Admin `alerts` and `webhook-logs` empty dari critical spike baru.
+
+## 2) Incident Response
+
+### 2.1 Severity Triage
+
+- **P0 (Critical)**: service tidak bisa melayani request, `health.status=down`, atau DB tidak bisa diakses.
+- **P1 (High)**: retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
+- **P2 (Medium)**: banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
+- **P3 (Low)**: degraded ringan dan bukan blocking.
+
+### 2.2 Immediate Actions
+
+1. Cek health dan readiness:
+   - `npm run ops:healthcheck`
+   - `npm run ops:readiness`
+2. Ambil snapshot insiden:
+   - `npm run ops:incident`
+3. Cek halaman Super Admin:
+   - `/super-admin/alerts`
+   - `/super-admin/webhook-logs`
+4. Cek worker retry:
+   - `GET /api/jobs/campaign-retry?token=<token>`
+5. Jika retry lock stuck (`isStaleLock`/`consecutiveFailures` tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.
+
+## 3) Recovery Actions
+
+### Recovery: Retry Worker Stuck
+
+- Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
+- Pastikan token masih valid.
+- Jalankan:
+  - `npm run job:campaign-retry` (one-shot) dan pantau logs.
+
+### Recovery: Webhook Spike
+
+- Periksa provider/secret webhook.
+- Konfirmasi endpoint webhook dan signature key:
+  - `WHATSAPP_WEBHOOK_SECRET`
+  - `WHATSAPP_WEBHOOK_VERIFY_TOKEN`
+- Periksa event backlog:
+  - `WebhookEvents` berstatus `failed` meningkat → ambil detail event terakhir.
+
+### Recovery: Channel Disconnected
+
+- Periksa channel di Super Admin > Channel.
+- Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
+- Trigger reconnect action sesuai SOP integrasi channel.
+
+## 4) Rollback
+
+Gunakan rollback jika bug berdampak layanan dalam 10–15 menit setelah deploy dan tidak bisa di-fix cepat.
+
+### Langkah
+
+1. Pin commit deploy sebelumnya.
+2. Deploy rollback image/commit.
+3. Stop new workers version lama/baru:
+   - restart service agar memakai bundle baru.
+4. Run:
+   - `npm run db:deploy` (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).
+   - `npm run ops:readiness`
+5. Monitor:
+   - `GET /api/health`
+   - Super Admin alerts 15 menit pertama.
+
+### Catatan Migrasi
+
+- Prioritas rollback database: hindari `prisma migrate reset` di production.
+- Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.
+
+## 5) Communication Template
+
+- P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
+  - status, start time, impact, ETA, update per 10 menit.
+- P2: update internal setelah identifikasi penyebab + estimasi perbaikan.