3.8 KiB
3.8 KiB
QRIS Soundbox Platform Operational Runbook
Scope
Runbook ini untuk pilot/staging/production operator. Semua command diasumsikan dijalankan dari root repo atau release directory.
Pre-Deploy
- Pull/build release artifact.
- Isi environment production dan pastikan secret bukan default.
- Jalankan:
npm ci
npm run typecheck
npm audit
npm run db:migrate
npm run deploy:check-env
npm run mqtt:check-acl -- --file /etc/mosquitto/acl
- Buat/cek admin dan merchant user production:
npm run admin:create-user -- --email <email> --name <name> --role admin --password <strong-password>
npm run merchant:create-user -- --merchant <merchant-id-or-code> --email <email> --name <name> --role owner --password <strong-password>
Deploy
- Jalankan migration sebelum service baru menerima traffic:
npm run db:migrate
- Start/restart service dengan
LOG_FORMAT=json. - Cek:
curl -fsS http://127.0.0.1:3000/health
curl -fsS http://127.0.0.1:3000/health/deep
- Cek admin authenticated health:
curl -fsS -H "Authorization: Bearer <admin-token>" http://127.0.0.1:3000/admin/health/deep
Post-Deploy Smoke
npm run smoke:e2e
npm run ui:qa
npm run smoke:mqtt-real
MQTT_TEST_DEVICE_A_USERNAME=<device-a-id> MQTT_TEST_DEVICE_A_PASSWORD=<secret-a> MQTT_TEST_DEVICE_B_USERNAME=<device-b-id> npm run smoke:mqtt-acl
Untuk staging/production-like baseline:
BASE_URL=https://staging.example.com npm run load:test:staging
Simpan report reports/load-staging-*.json bersama catatan release.
Backup
Sebelum deploy besar dan minimal harian:
npm run backup:production -- --out /var/backups/qris --include-mosquitto
Pastikan backup disalin ke storage aman dan terenkripsi. File penting:
- Postgres dump
.dump - Mosquitto passwd
- Mosquitto ACL
- Environment/secret reference di secret manager, bukan file plain text
Restore Drill
- Siapkan database disposable.
- Tampilkan rencana:
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump
- Jalankan restore hanya ke database disposable:
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump -- --execute
- Start service mengarah ke DB restore.
- Validasi:
npm run restore:validate
Rollback
- Hentikan traffic ke release baru.
- Rollback service image/release ke versi sebelumnya.
- Jika migration baru hanya additive, jangan rollback database.
- Jika database harus dikembalikan, restore dari backup terbaru ke database disposable dulu, lalu promote sesuai prosedur infra.
- Jalankan
/health,/admin/health/deep, dan smoke minimal.
Incident Response
API latency/error naik
- Cek
/admin/observability/summary. - Cek log dengan
request_id/trace_id. - Cek Postgres connection dan slow query.
- Turunkan traffic atau rate limit jika perlu.
MQTT publish/subscribe bermasalah
- Cek
/admin/mqtt/status. - Cek broker service, certificate, ACL, dan passwd.
- Jalankan
npm run smoke:mqtt-real. - Untuk credential device, rotate via UI atau
npm run mqtt:provision-device.
Export macet
- Cek
/admin/observability/summarybagianexport_jobs. - Pastikan
EXPORT_STORAGE_DIRwritable. - Restart worker/app untuk reset stale running job.
- Jika file expired, minta user membuat export baru.
Login brute force
- Cek audit log action
admin.login.faileddanmerchant.login.failed. - Naikkan strictness
RATE_LIMIT_LOGIN_MAX. - Disable user mencurigakan via DB/admin tooling sementara.
Routine Operations
- Harian: cek health/deep health, backup, MQTT status, failed notification.
- Mingguan: restore drill sample, review audit failed login, review export storage usage.
- Sebelum pilot device baru: provision credential, update broker passwd, validate ACL, smoke MQTT ACL.