61 lines
1.8 KiB
Markdown
61 lines
1.8 KiB
Markdown
# Alert Policy - WhatsApp Inbox
|
||
|
||
## Severity Matrix
|
||
|
||
- **P0 (Critical)** – Sistem down untuk seluruh user.
|
||
- Trigger:
|
||
- `GET /api/health` status `down`.
|
||
- DB unreachable.
|
||
- Tidak bisa mengirim/menarik retry campaign selama >15 menit.
|
||
- Response target:
|
||
- Acknowledge: 5 menit
|
||
- Mitigasi awal: 15 menit
|
||
- Owner: Platform Lead
|
||
|
||
- **P1 (High)** – Fitur inti terganggu (campaign, webhook, retry).
|
||
- Trigger:
|
||
- Retry worker status `failed` >= 3 kali berturut.
|
||
- `BackgroundJobState.consecutiveFailures` naik terus.
|
||
- `campaign-retry-worker` tidak berjalan > 60 menit.
|
||
- Failed webhook 1 jam melebihi threshold.
|
||
- Response target:
|
||
- Acknowledge: 15 menit
|
||
- Mitigasi awal: 45 menit
|
||
- Owner: Platform + Operations
|
||
|
||
- **P2 (Medium)** – Degradasi performa non-blocking.
|
||
- Trigger:
|
||
- `GET /api/health` `degraded`.
|
||
- Channel disconnected > 1 dalam 1 tenant.
|
||
- Response target:
|
||
- Acknowledge: 60 menit
|
||
- Mitigasi awal: 4 jam
|
||
- Owner: Platform
|
||
|
||
- **P3 (Low)** – Informasi operasional.
|
||
- Trigger:
|
||
- Kenaikan event minor, warning non-urgent.
|
||
- Response target:
|
||
- Acknowledge: next business cycle
|
||
|
||
## Alert Routing
|
||
|
||
- Primary: Slack/Discord webhook (`CAMPAIGN_RETRY_ALERT_WEBHOOK_URL`) untuk event retry failure.
|
||
- Secondary: Team channel / chat group.
|
||
- Escalation (P0/P1): paging on-call.
|
||
|
||
## Tuning
|
||
|
||
- Set `CAMPAIGN_RETRY_ALERT_ON_FAILURE=false` jika volume alert terlalu tinggi dan gunakan manual monitoring.
|
||
- Tune:
|
||
- `WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR` (default 20)
|
||
- `RETRY_WORKER_STALE_MINUTES` (default 30)
|
||
|
||
## Metrics reviewed in every shift
|
||
|
||
- `WebhookEvent` failure rate (1h)
|
||
- `BackgroundJobState.consecutiveFailures`
|
||
- `Channel.status` `DISCONNECTED`
|
||
- Queue depth `CampaignRecipient` by `sendStatus`
|
||
- Health endpoint status
|