Files
Qris-Soundbox/OPERATIONAL_RUNBOOK.md

3.8 KiB

QRIS Soundbox Platform Operational Runbook

Scope

Runbook ini untuk pilot/staging/production operator. Semua command diasumsikan dijalankan dari root repo atau release directory.

Pre-Deploy

  1. Pull/build release artifact.
  2. Isi environment production dan pastikan secret bukan default.
  3. Jalankan:
npm ci
npm run typecheck
npm audit
npm run db:migrate
npm run deploy:check-env
npm run mqtt:check-acl -- --file /etc/mosquitto/acl
  1. Buat/cek admin dan merchant user production:
npm run admin:create-user -- --email <email> --name <name> --role admin --password <strong-password>
npm run merchant:create-user -- --merchant <merchant-id-or-code> --email <email> --name <name> --role owner --password <strong-password>

Deploy

  1. Jalankan migration sebelum service baru menerima traffic:
npm run db:migrate
  1. Start/restart service dengan LOG_FORMAT=json.
  2. Cek:
curl -fsS http://127.0.0.1:3000/health
curl -fsS http://127.0.0.1:3000/health/deep
  1. Cek admin authenticated health:
curl -fsS -H "Authorization: Bearer <admin-token>" http://127.0.0.1:3000/admin/health/deep

Post-Deploy Smoke

npm run smoke:e2e
npm run ui:qa
npm run smoke:mqtt-real
MQTT_TEST_DEVICE_A_USERNAME=<device-a-id> MQTT_TEST_DEVICE_A_PASSWORD=<secret-a> MQTT_TEST_DEVICE_B_USERNAME=<device-b-id> npm run smoke:mqtt-acl

Untuk staging/production-like baseline:

BASE_URL=https://staging.example.com npm run load:test:staging

Simpan report reports/load-staging-*.json bersama catatan release.

Backup

Sebelum deploy besar dan minimal harian:

npm run backup:production -- --out /var/backups/qris --include-mosquitto

Pastikan backup disalin ke storage aman dan terenkripsi. File penting:

  • Postgres dump .dump
  • Mosquitto passwd
  • Mosquitto ACL
  • Environment/secret reference di secret manager, bukan file plain text

Restore Drill

  1. Siapkan database disposable.
  2. Tampilkan rencana:
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump
  1. Jalankan restore hanya ke database disposable:
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump -- --execute
  1. Start service mengarah ke DB restore.
  2. Validasi:
npm run restore:validate

Rollback

  1. Hentikan traffic ke release baru.
  2. Rollback service image/release ke versi sebelumnya.
  3. Jika migration baru hanya additive, jangan rollback database.
  4. Jika database harus dikembalikan, restore dari backup terbaru ke database disposable dulu, lalu promote sesuai prosedur infra.
  5. Jalankan /health, /admin/health/deep, dan smoke minimal.

Incident Response

API latency/error naik

  1. Cek /admin/observability/summary.
  2. Cek log dengan request_id/trace_id.
  3. Cek Postgres connection dan slow query.
  4. Turunkan traffic atau rate limit jika perlu.

MQTT publish/subscribe bermasalah

  1. Cek /admin/mqtt/status.
  2. Cek broker service, certificate, ACL, dan passwd.
  3. Jalankan npm run smoke:mqtt-real.
  4. Untuk credential device, rotate via UI atau npm run mqtt:provision-device.

Export macet

  1. Cek /admin/observability/summary bagian export_jobs.
  2. Pastikan EXPORT_STORAGE_DIR writable.
  3. Restart worker/app untuk reset stale running job.
  4. Jika file expired, minta user membuat export baru.

Login brute force

  1. Cek audit log action admin.login.failed dan merchant.login.failed.
  2. Naikkan strictness RATE_LIMIT_LOGIN_MAX.
  3. Disable user mencurigakan via DB/admin tooling sementara.

Routine Operations

  • Harian: cek health/deep health, backup, MQTT status, failed notification.
  • Mingguan: restore drill sample, review audit failed login, review export storage usage.
  • Sebelum pilot device baru: provision credential, update broker passwd, validate ACL, smoke MQTT ACL.