Files
whatsapp-inbox-platform/alert-policy.md
Wira Basalamah adde003fba
Some checks failed
CI - Production Readiness / Verify (push) Has been cancelled
chore: initial project import
2026-04-21 09:29:29 +07:00

1.8 KiB
Raw Blame History

Alert Policy - WhatsApp Inbox

Severity Matrix

  • P0 (Critical) Sistem down untuk seluruh user.

    • Trigger:
      • GET /api/health status down.
      • DB unreachable.
      • Tidak bisa mengirim/menarik retry campaign selama >15 menit.
    • Response target:
      • Acknowledge: 5 menit
      • Mitigasi awal: 15 menit
    • Owner: Platform Lead
  • P1 (High) Fitur inti terganggu (campaign, webhook, retry).

    • Trigger:
      • Retry worker status failed >= 3 kali berturut.
      • BackgroundJobState.consecutiveFailures naik terus.
      • campaign-retry-worker tidak berjalan > 60 menit.
      • Failed webhook 1 jam melebihi threshold.
    • Response target:
      • Acknowledge: 15 menit
      • Mitigasi awal: 45 menit
    • Owner: Platform + Operations
  • P2 (Medium) Degradasi performa non-blocking.

    • Trigger:
      • GET /api/health degraded.
      • Channel disconnected > 1 dalam 1 tenant.
    • Response target:
      • Acknowledge: 60 menit
      • Mitigasi awal: 4 jam
    • Owner: Platform
  • P3 (Low) Informasi operasional.

    • Trigger:
      • Kenaikan event minor, warning non-urgent.
    • Response target:
      • Acknowledge: next business cycle

Alert Routing

  • Primary: Slack/Discord webhook (CAMPAIGN_RETRY_ALERT_WEBHOOK_URL) untuk event retry failure.
  • Secondary: Team channel / chat group.
  • Escalation (P0/P1): paging on-call.

Tuning

  • Set CAMPAIGN_RETRY_ALERT_ON_FAILURE=false jika volume alert terlalu tinggi dan gunakan manual monitoring.
  • Tune:
    • WEBHOOK_FAILURE_RATE_THRESHOLD_PER_HOUR (default 20)
    • RETRY_WORKER_STALE_MINUTES (default 30)

Metrics reviewed in every shift

  • WebhookEvent failure rate (1h)
  • BackgroundJobState.consecutiveFailures
  • Channel.status DISCONNECTED
  • Queue depth CampaignRecipient by sendStatus
  • Health endpoint status