chore: initial project import
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled

This commit is contained in:
Wira Basalamah
2026-04-21 09:29:29 +07:00
commit adde003fba
222 changed files with 37657 additions and 0 deletions

60
alert-policy.md Normal file
View File

@ -0,0 +1,60 @@
# Alert Policy - WhatsApp Inbox
## Severity Matrix
- **P0 (Critical)** Sistem down untuk seluruh user.
- Trigger:
- `GET /api/health` status `down`.
- DB unreachable.
- Tidak bisa mengirim/menarik retry campaign selama >15 menit.
- Response target:
- Acknowledge: 5 menit
- Mitigasi awal: 15 menit
- Owner: Platform Lead
- **P1 (High)** Fitur inti terganggu (campaign, webhook, retry).
- Trigger:
- Retry worker status `failed` >= 3 kali berturut.
- `BackgroundJobState.consecutiveFailures` naik terus.
- `campaign-retry-worker` tidak berjalan > 60 menit.
- Failed webhook 1 jam melebihi threshold.
- Response target:
- Acknowledge: 15 menit
- Mitigasi awal: 45 menit
- Owner: Platform + Operations
- **P2 (Medium)** Degradasi performa non-blocking.
- Trigger:
- `GET /api/health` `degraded`.
- Channel disconnected > 1 dalam 1 tenant.
- Response target:
- Acknowledge: 60 menit
- Mitigasi awal: 4 jam
- Owner: Platform
- **P3 (Low)** Informasi operasional.
- Trigger:
- Kenaikan event minor, warning non-urgent.
- Response target:
- Acknowledge: next business cycle
## Alert Routing
- Primary: Slack/Discord webhook (`CAMPAIGN_RETRY_ALERT_WEBHOOK_URL`) untuk event retry failure.
- Secondary: Team channel / chat group.
- Escalation (P0/P1): paging on-call.
## Tuning
- Set `CAMPAIGN_RETRY_ALERT_ON_FAILURE=false` jika volume alert terlalu tinggi dan gunakan manual monitoring.
- Tune:
- `WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR` (default 20)
- `RETRY_WORKER_STALE_MINUTES` (default 30)
## Metrics reviewed in every shift
- `WebhookEvent` failure rate (1h)
- `BackgroundJobState.consecutiveFailures`
- `Channel.status` `DISCONNECTED`
- Queue depth `CampaignRecipient` by `sendStatus`
- Health endpoint status