# Alert Policy - WhatsApp Inbox ## Severity Matrix - **P0 (Critical)** – Sistem down untuk seluruh user. - Trigger: - `GET /api/health` status `down`. - DB unreachable. - Tidak bisa mengirim/menarik retry campaign selama >15 menit. - Response target: - Acknowledge: 5 menit - Mitigasi awal: 15 menit - Owner: Platform Lead - **P1 (High)** – Fitur inti terganggu (campaign, webhook, retry). - Trigger: - Retry worker status `failed` >= 3 kali berturut. - `BackgroundJobState.consecutiveFailures` naik terus. - `campaign-retry-worker` tidak berjalan > 60 menit. - Failed webhook 1 jam melebihi threshold. - Response target: - Acknowledge: 15 menit - Mitigasi awal: 45 menit - Owner: Platform + Operations - **P2 (Medium)** – Degradasi performa non-blocking. - Trigger: - `GET /api/health` `degraded`. - Channel disconnected > 1 dalam 1 tenant. - Response target: - Acknowledge: 60 menit - Mitigasi awal: 4 jam - Owner: Platform - **P3 (Low)** – Informasi operasional. - Trigger: - Kenaikan event minor, warning non-urgent. - Response target: - Acknowledge: next business cycle ## Alert Routing - Primary: Slack/Discord webhook (`CAMPAIGN_RETRY_ALERT_WEBHOOK_URL`) untuk event retry failure. - Secondary: Team channel / chat group. - Escalation (P0/P1): paging on-call. ## Tuning - Set `CAMPAIGN_RETRY_ALERT_ON_FAILURE=false` jika volume alert terlalu tinggi dan gunakan manual monitoring. - Tune: - `WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR` (default 20) - `RETRY_WORKER_STALE_MINUTES` (default 30) ## Metrics reviewed in every shift - `WebhookEvent` failure rate (1h) - `BackgroundJobState.consecutiveFailures` - `Channel.status` `DISCONNECTED` - Queue depth `CampaignRecipient` by `sendStatus` - Health endpoint status