chore: initial project import
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
This commit is contained in:
60
alert-policy.md
Normal file
60
alert-policy.md
Normal file
@ -0,0 +1,60 @@
|
||||
# Alert Policy - WhatsApp Inbox
|
||||
|
||||
## Severity Matrix
|
||||
|
||||
- **P0 (Critical)** – Sistem down untuk seluruh user.
|
||||
- Trigger:
|
||||
- `GET /api/health` status `down`.
|
||||
- DB unreachable.
|
||||
- Tidak bisa mengirim/menarik retry campaign selama >15 menit.
|
||||
- Response target:
|
||||
- Acknowledge: 5 menit
|
||||
- Mitigasi awal: 15 menit
|
||||
- Owner: Platform Lead
|
||||
|
||||
- **P1 (High)** – Fitur inti terganggu (campaign, webhook, retry).
|
||||
- Trigger:
|
||||
- Retry worker status `failed` >= 3 kali berturut.
|
||||
- `BackgroundJobState.consecutiveFailures` naik terus.
|
||||
- `campaign-retry-worker` tidak berjalan > 60 menit.
|
||||
- Failed webhook 1 jam melebihi threshold.
|
||||
- Response target:
|
||||
- Acknowledge: 15 menit
|
||||
- Mitigasi awal: 45 menit
|
||||
- Owner: Platform + Operations
|
||||
|
||||
- **P2 (Medium)** – Degradasi performa non-blocking.
|
||||
- Trigger:
|
||||
- `GET /api/health` `degraded`.
|
||||
- Channel disconnected > 1 dalam 1 tenant.
|
||||
- Response target:
|
||||
- Acknowledge: 60 menit
|
||||
- Mitigasi awal: 4 jam
|
||||
- Owner: Platform
|
||||
|
||||
- **P3 (Low)** – Informasi operasional.
|
||||
- Trigger:
|
||||
- Kenaikan event minor, warning non-urgent.
|
||||
- Response target:
|
||||
- Acknowledge: next business cycle
|
||||
|
||||
## Alert Routing
|
||||
|
||||
- Primary: Slack/Discord webhook (`CAMPAIGN_RETRY_ALERT_WEBHOOK_URL`) untuk event retry failure.
|
||||
- Secondary: Team channel / chat group.
|
||||
- Escalation (P0/P1): paging on-call.
|
||||
|
||||
## Tuning
|
||||
|
||||
- Set `CAMPAIGN_RETRY_ALERT_ON_FAILURE=false` jika volume alert terlalu tinggi dan gunakan manual monitoring.
|
||||
- Tune:
|
||||
- `WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR` (default 20)
|
||||
- `RETRY_WORKER_STALE_MINUTES` (default 30)
|
||||
|
||||
## Metrics reviewed in every shift
|
||||
|
||||
- `WebhookEvent` failure rate (1h)
|
||||
- `BackgroundJobState.consecutiveFailures`
|
||||
- `Channel.status` `DISCONNECTED`
|
||||
- Queue depth `CampaignRecipient` by `sendStatus`
|
||||
- Health endpoint status
|
||||
Reference in New Issue
Block a user