Files
Qris-Soundbox/OPERATIONAL_RUNBOOK.md

147 lines
3.8 KiB
Markdown

# QRIS Soundbox Platform Operational Runbook
## Scope
Runbook ini untuk pilot/staging/production operator. Semua command diasumsikan dijalankan dari root repo atau release directory.
## Pre-Deploy
1. Pull/build release artifact.
2. Isi environment production dan pastikan secret bukan default.
3. Jalankan:
```bash
npm ci
npm run typecheck
npm audit
npm run db:migrate
npm run deploy:check-env
npm run mqtt:check-acl -- --file /etc/mosquitto/acl
```
4. Buat/cek admin dan merchant user production:
```bash
npm run admin:create-user -- --email <email> --name <name> --role admin --password <strong-password>
npm run merchant:create-user -- --merchant <merchant-id-or-code> --email <email> --name <name> --role owner --password <strong-password>
```
## Deploy
1. Jalankan migration sebelum service baru menerima traffic:
```bash
npm run db:migrate
```
2. Start/restart service dengan `LOG_FORMAT=json`.
3. Cek:
```bash
curl -fsS http://127.0.0.1:3000/health
curl -fsS http://127.0.0.1:3000/health/deep
```
4. Cek admin authenticated health:
```bash
curl -fsS -H "Authorization: Bearer <admin-token>" http://127.0.0.1:3000/admin/health/deep
```
## Post-Deploy Smoke
```bash
npm run smoke:e2e
npm run ui:qa
npm run smoke:mqtt-real
MQTT_TEST_DEVICE_A_USERNAME=<device-a-id> MQTT_TEST_DEVICE_A_PASSWORD=<secret-a> MQTT_TEST_DEVICE_B_USERNAME=<device-b-id> npm run smoke:mqtt-acl
```
Untuk staging/production-like baseline:
```bash
BASE_URL=https://staging.example.com npm run load:test:staging
```
Simpan report `reports/load-staging-*.json` bersama catatan release.
## Backup
Sebelum deploy besar dan minimal harian:
```bash
npm run backup:production -- --out /var/backups/qris --include-mosquitto
```
Pastikan backup disalin ke storage aman dan terenkripsi. File penting:
- Postgres dump `.dump`
- Mosquitto passwd
- Mosquitto ACL
- Environment/secret reference di secret manager, bukan file plain text
## Restore Drill
1. Siapkan database disposable.
2. Tampilkan rencana:
```bash
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump
```
3. Jalankan restore hanya ke database disposable:
```bash
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump -- --execute
```
4. Start service mengarah ke DB restore.
5. Validasi:
```bash
npm run restore:validate
```
## Rollback
1. Hentikan traffic ke release baru.
2. Rollback service image/release ke versi sebelumnya.
3. Jika migration baru hanya additive, jangan rollback database.
4. Jika database harus dikembalikan, restore dari backup terbaru ke database disposable dulu, lalu promote sesuai prosedur infra.
5. Jalankan `/health`, `/admin/health/deep`, dan smoke minimal.
## Incident Response
### API latency/error naik
1. Cek `/admin/observability/summary`.
2. Cek log dengan `request_id`/`trace_id`.
3. Cek Postgres connection dan slow query.
4. Turunkan traffic atau rate limit jika perlu.
### MQTT publish/subscribe bermasalah
1. Cek `/admin/mqtt/status`.
2. Cek broker service, certificate, ACL, dan passwd.
3. Jalankan `npm run smoke:mqtt-real`.
4. Untuk credential device, rotate via UI atau `npm run mqtt:provision-device`.
### Export macet
1. Cek `/admin/observability/summary` bagian `export_jobs`.
2. Pastikan `EXPORT_STORAGE_DIR` writable.
3. Restart worker/app untuk reset stale running job.
4. Jika file expired, minta user membuat export baru.
### Login brute force
1. Cek audit log action `admin.login.failed` dan `merchant.login.failed`.
2. Naikkan strictness `RATE_LIMIT_LOGIN_MAX`.
3. Disable user mencurigakan via DB/admin tooling sementara.
## Routine Operations
- Harian: cek health/deep health, backup, MQTT status, failed notification.
- Mingguan: restore drill sample, review audit failed login, review export storage usage.
- Sebelum pilot device baru: provision credential, update broker passwd, validate ACL, smoke MQTT ACL.