Files
whatsapp-inbox-platform/alert-policy.md
Wira Basalamah adde003fba
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
chore: initial project import
2026-04-21 09:29:29 +07:00

61 lines
1.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Alert Policy - WhatsApp Inbox
## Severity Matrix
- **P0 (Critical)** Sistem down untuk seluruh user.
- Trigger:
- `GET /api/health` status `down`.
- DB unreachable.
- Tidak bisa mengirim/menarik retry campaign selama >15 menit.
- Response target:
- Acknowledge: 5 menit
- Mitigasi awal: 15 menit
- Owner: Platform Lead
- **P1 (High)** Fitur inti terganggu (campaign, webhook, retry).
- Trigger:
- Retry worker status `failed` >= 3 kali berturut.
- `BackgroundJobState.consecutiveFailures` naik terus.
- `campaign-retry-worker` tidak berjalan > 60 menit.
- Failed webhook 1 jam melebihi threshold.
- Response target:
- Acknowledge: 15 menit
- Mitigasi awal: 45 menit
- Owner: Platform + Operations
- **P2 (Medium)** Degradasi performa non-blocking.
- Trigger:
- `GET /api/health` `degraded`.
- Channel disconnected > 1 dalam 1 tenant.
- Response target:
- Acknowledge: 60 menit
- Mitigasi awal: 4 jam
- Owner: Platform
- **P3 (Low)** Informasi operasional.
- Trigger:
- Kenaikan event minor, warning non-urgent.
- Response target:
- Acknowledge: next business cycle
## Alert Routing
- Primary: Slack/Discord webhook (`CAMPAIGN_RETRY_ALERT_WEBHOOK_URL`) untuk event retry failure.
- Secondary: Team channel / chat group.
- Escalation (P0/P1): paging on-call.
## Tuning
- Set `CAMPAIGN_RETRY_ALERT_ON_FAILURE=false` jika volume alert terlalu tinggi dan gunakan manual monitoring.
- Tune:
- `WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR` (default 20)
- `RETRY_WORKER_STALE_MINUTES` (default 30)
## Metrics reviewed in every shift
- `WebhookEvent` failure rate (1h)
- `BackgroundJobState.consecutiveFailures`
- `Channel.status` `DISCONNECTED`
- Queue depth `CampaignRecipient` by `sendStatus`
- Health endpoint status