chore: initial project import
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled

This commit is contained in:
Wira Basalamah
2026-04-21 09:29:29 +07:00
commit adde003fba
222 changed files with 37657 additions and 0 deletions

91
ops-runbook.md Normal file
View File

@ -0,0 +1,91 @@
# Production Runbook - WhatsApp Inbox
## 1) Normal Deployment
- Deploy code.
- Run migration:
- `npm run db:deploy`
- Update environment variables and secrets.
- Start app service.
- Start retry worker (`daemon` or cron).
- Start maintenance cleanup (`npm run ops:maintenance`) on a periodic schedule, e.g. daily.
- Run readiness:
- `npm run ops:readiness`
- Monitor:
- `GET /api/health` returns `ok`/`degraded` not `down`.
- Super Admin `alerts` and `webhook-logs` empty dari critical spike baru.
## 2) Incident Response
### 2.1 Severity Triage
- **P0 (Critical)**: service tidak bisa melayani request, `health.status=down`, atau DB tidak bisa diakses.
- **P1 (High)**: retry worker down/failed berulang, webhook gagal meningkat tajam, banyak channel disconnected.
- **P2 (Medium)**: banyak notifikasi gagal/terlambat tapi fungsional masih berjalan.
- **P3 (Low)**: degraded ringan dan bukan blocking.
### 2.2 Immediate Actions
1. Cek health dan readiness:
- `npm run ops:healthcheck`
- `npm run ops:readiness`
2. Ambil snapshot insiden:
- `npm run ops:incident`
3. Cek halaman Super Admin:
- `/super-admin/alerts`
- `/super-admin/webhook-logs`
4. Cek worker retry:
- `GET /api/jobs/campaign-retry?token=<token>`
5. Jika retry lock stuck (`isStaleLock`/`consecutiveFailures` tinggi), kirim notifikasi manual ke team on-call dan lanjut tindakan recovery berikut.
## 3) Recovery Actions
### Recovery: Retry Worker Stuck
- Jika lock terlihat stale, restart worker daemon/service (cron run otomatis akan recover).
- Pastikan token masih valid.
- Jalankan:
- `npm run job:campaign-retry` (one-shot) dan pantau logs.
### Recovery: Webhook Spike
- Periksa provider/secret webhook.
- Konfirmasi endpoint webhook dan signature key:
- `WHATSAPP_WEBHOOK_SECRET`
- `WHATSAPP_WEBHOOK_VERIFY_TOKEN`
- Periksa event backlog:
- `WebhookEvents` berstatus `failed` meningkat → ambil detail event terakhir.
### Recovery: Channel Disconnected
- Periksa channel di Super Admin > Channel.
- Pastikan nomor/hotfix config (phone number ID, token, WABA ID) masih valid.
- Trigger reconnect action sesuai SOP integrasi channel.
## 4) Rollback
Gunakan rollback jika bug berdampak layanan dalam 1015 menit setelah deploy dan tidak bisa di-fix cepat.
### Langkah
1. Pin commit deploy sebelumnya.
2. Deploy rollback image/commit.
3. Stop new workers version lama/baru:
- restart service agar memakai bundle baru.
4. Run:
- `npm run db:deploy` (verifikasi skema masih cocok; hindari rollback migrasi jika belum direncanakan).
- `npm run ops:readiness`
5. Monitor:
- `GET /api/health`
- Super Admin alerts 15 menit pertama.
### Catatan Migrasi
- Prioritas rollback database: hindari `prisma migrate reset` di production.
- Jika ada perubahan schema yang belum backward compatible, lakukan rollback lewat migrasi down (jika tersedia) via DBA process.
## 5) Communication Template
- P0/P1: broadcast ke channel #incident + Slack/WhatsApp:
- status, start time, impact, ETA, update per 10 menit.
- P2: update internal setelah identifikasi penyebab + estimasi perbaikan.