1.8 KiB
1.8 KiB
Alert Policy - WhatsApp Inbox
Severity Matrix
-
P0 (Critical) – Sistem down untuk seluruh user.
- Trigger:
GET /api/healthstatusdown.- DB unreachable.
- Tidak bisa mengirim/menarik retry campaign selama >15 menit.
- Response target:
- Acknowledge: 5 menit
- Mitigasi awal: 15 menit
- Owner: Platform Lead
- Trigger:
-
P1 (High) – Fitur inti terganggu (campaign, webhook, retry).
- Trigger:
- Retry worker status
failed>= 3 kali berturut. BackgroundJobState.consecutiveFailuresnaik terus.campaign-retry-workertidak berjalan > 60 menit.- Failed webhook 1 jam melebihi threshold.
- Retry worker status
- Response target:
- Acknowledge: 15 menit
- Mitigasi awal: 45 menit
- Owner: Platform + Operations
- Trigger:
-
P2 (Medium) – Degradasi performa non-blocking.
- Trigger:
GET /api/healthdegraded.- Channel disconnected > 1 dalam 1 tenant.
- Response target:
- Acknowledge: 60 menit
- Mitigasi awal: 4 jam
- Owner: Platform
- Trigger:
-
P3 (Low) – Informasi operasional.
- Trigger:
- Kenaikan event minor, warning non-urgent.
- Response target:
- Acknowledge: next business cycle
- Trigger:
Alert Routing
- Primary: Slack/Discord webhook (
CAMPAIGN_RETRY_ALERT_WEBHOOK_URL) untuk event retry failure. - Secondary: Team channel / chat group.
- Escalation (P0/P1): paging on-call.
Tuning
- Set
CAMPAIGN_RETRY_ALERT_ON_FAILURE=falsejika volume alert terlalu tinggi dan gunakan manual monitoring. - Tune:
WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR(default 20)RETRY_WORKER_STALE_MINUTES(default 30)
Metrics reviewed in every shift
WebhookEventfailure rate (1h)BackgroundJobState.consecutiveFailuresChannel.statusDISCONNECTED- Queue depth
CampaignRecipientbysendStatus - Health endpoint status