147 lines
3.8 KiB
Markdown
147 lines
3.8 KiB
Markdown
# QRIS Soundbox Platform Operational Runbook
|
|
|
|
## Scope
|
|
|
|
Runbook ini untuk pilot/staging/production operator. Semua command diasumsikan dijalankan dari root repo atau release directory.
|
|
|
|
## Pre-Deploy
|
|
|
|
1. Pull/build release artifact.
|
|
2. Isi environment production dan pastikan secret bukan default.
|
|
3. Jalankan:
|
|
|
|
```bash
|
|
npm ci
|
|
npm run typecheck
|
|
npm audit
|
|
npm run db:migrate
|
|
npm run deploy:check-env
|
|
npm run mqtt:check-acl -- --file /etc/mosquitto/acl
|
|
```
|
|
|
|
4. Buat/cek admin dan merchant user production:
|
|
|
|
```bash
|
|
npm run admin:create-user -- --email <email> --name <name> --role admin --password <strong-password>
|
|
npm run merchant:create-user -- --merchant <merchant-id-or-code> --email <email> --name <name> --role owner --password <strong-password>
|
|
```
|
|
|
|
## Deploy
|
|
|
|
1. Jalankan migration sebelum service baru menerima traffic:
|
|
|
|
```bash
|
|
npm run db:migrate
|
|
```
|
|
|
|
2. Start/restart service dengan `LOG_FORMAT=json`.
|
|
3. Cek:
|
|
|
|
```bash
|
|
curl -fsS http://127.0.0.1:3000/health
|
|
curl -fsS http://127.0.0.1:3000/health/deep
|
|
```
|
|
|
|
4. Cek admin authenticated health:
|
|
|
|
```bash
|
|
curl -fsS -H "Authorization: Bearer <admin-token>" http://127.0.0.1:3000/admin/health/deep
|
|
```
|
|
|
|
## Post-Deploy Smoke
|
|
|
|
```bash
|
|
npm run smoke:e2e
|
|
npm run ui:qa
|
|
npm run smoke:mqtt-real
|
|
MQTT_TEST_DEVICE_A_USERNAME=<device-a-id> MQTT_TEST_DEVICE_A_PASSWORD=<secret-a> MQTT_TEST_DEVICE_B_USERNAME=<device-b-id> npm run smoke:mqtt-acl
|
|
```
|
|
|
|
Untuk staging/production-like baseline:
|
|
|
|
```bash
|
|
BASE_URL=https://staging.example.com npm run load:test:staging
|
|
```
|
|
|
|
Simpan report `reports/load-staging-*.json` bersama catatan release.
|
|
|
|
## Backup
|
|
|
|
Sebelum deploy besar dan minimal harian:
|
|
|
|
```bash
|
|
npm run backup:production -- --out /var/backups/qris --include-mosquitto
|
|
```
|
|
|
|
Pastikan backup disalin ke storage aman dan terenkripsi. File penting:
|
|
|
|
- Postgres dump `.dump`
|
|
- Mosquitto passwd
|
|
- Mosquitto ACL
|
|
- Environment/secret reference di secret manager, bukan file plain text
|
|
|
|
## Restore Drill
|
|
|
|
1. Siapkan database disposable.
|
|
2. Tampilkan rencana:
|
|
|
|
```bash
|
|
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump
|
|
```
|
|
|
|
3. Jalankan restore hanya ke database disposable:
|
|
|
|
```bash
|
|
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump -- --execute
|
|
```
|
|
|
|
4. Start service mengarah ke DB restore.
|
|
5. Validasi:
|
|
|
|
```bash
|
|
npm run restore:validate
|
|
```
|
|
|
|
## Rollback
|
|
|
|
1. Hentikan traffic ke release baru.
|
|
2. Rollback service image/release ke versi sebelumnya.
|
|
3. Jika migration baru hanya additive, jangan rollback database.
|
|
4. Jika database harus dikembalikan, restore dari backup terbaru ke database disposable dulu, lalu promote sesuai prosedur infra.
|
|
5. Jalankan `/health`, `/admin/health/deep`, dan smoke minimal.
|
|
|
|
## Incident Response
|
|
|
|
### API latency/error naik
|
|
|
|
1. Cek `/admin/observability/summary`.
|
|
2. Cek log dengan `request_id`/`trace_id`.
|
|
3. Cek Postgres connection dan slow query.
|
|
4. Turunkan traffic atau rate limit jika perlu.
|
|
|
|
### MQTT publish/subscribe bermasalah
|
|
|
|
1. Cek `/admin/mqtt/status`.
|
|
2. Cek broker service, certificate, ACL, dan passwd.
|
|
3. Jalankan `npm run smoke:mqtt-real`.
|
|
4. Untuk credential device, rotate via UI atau `npm run mqtt:provision-device`.
|
|
|
|
### Export macet
|
|
|
|
1. Cek `/admin/observability/summary` bagian `export_jobs`.
|
|
2. Pastikan `EXPORT_STORAGE_DIR` writable.
|
|
3. Restart worker/app untuk reset stale running job.
|
|
4. Jika file expired, minta user membuat export baru.
|
|
|
|
### Login brute force
|
|
|
|
1. Cek audit log action `admin.login.failed` dan `merchant.login.failed`.
|
|
2. Naikkan strictness `RATE_LIMIT_LOGIN_MAX`.
|
|
3. Disable user mencurigakan via DB/admin tooling sementara.
|
|
|
|
## Routine Operations
|
|
|
|
- Harian: cek health/deep health, backup, MQTT status, failed notification.
|
|
- Mingguan: restore drill sample, review audit failed login, review export storage usage.
|
|
- Sebelum pilot device baru: provision credential, update broker passwd, validate ACL, smoke MQTT ACL.
|