Production readiness hardening and ops tooling
This commit is contained in:
146
OPERATIONAL_RUNBOOK.md
Normal file
146
OPERATIONAL_RUNBOOK.md
Normal file
@ -0,0 +1,146 @@
|
||||
# QRIS Soundbox Platform Operational Runbook
|
||||
|
||||
## Scope
|
||||
|
||||
Runbook ini untuk pilot/staging/production operator. Semua command diasumsikan dijalankan dari root repo atau release directory.
|
||||
|
||||
## Pre-Deploy
|
||||
|
||||
1. Pull/build release artifact.
|
||||
2. Isi environment production dan pastikan secret bukan default.
|
||||
3. Jalankan:
|
||||
|
||||
```bash
|
||||
npm ci
|
||||
npm run typecheck
|
||||
npm audit
|
||||
npm run db:migrate
|
||||
npm run deploy:check-env
|
||||
npm run mqtt:check-acl -- --file /etc/mosquitto/acl
|
||||
```
|
||||
|
||||
4. Buat/cek admin dan merchant user production:
|
||||
|
||||
```bash
|
||||
npm run admin:create-user -- --email <email> --name <name> --role admin --password <strong-password>
|
||||
npm run merchant:create-user -- --merchant <merchant-id-or-code> --email <email> --name <name> --role owner --password <strong-password>
|
||||
```
|
||||
|
||||
## Deploy
|
||||
|
||||
1. Jalankan migration sebelum service baru menerima traffic:
|
||||
|
||||
```bash
|
||||
npm run db:migrate
|
||||
```
|
||||
|
||||
2. Start/restart service dengan `LOG_FORMAT=json`.
|
||||
3. Cek:
|
||||
|
||||
```bash
|
||||
curl -fsS http://127.0.0.1:3000/health
|
||||
curl -fsS http://127.0.0.1:3000/health/deep
|
||||
```
|
||||
|
||||
4. Cek admin authenticated health:
|
||||
|
||||
```bash
|
||||
curl -fsS -H "Authorization: Bearer <admin-token>" http://127.0.0.1:3000/admin/health/deep
|
||||
```
|
||||
|
||||
## Post-Deploy Smoke
|
||||
|
||||
```bash
|
||||
npm run smoke:e2e
|
||||
npm run ui:qa
|
||||
npm run smoke:mqtt-real
|
||||
MQTT_TEST_DEVICE_A_USERNAME=<device-a-id> MQTT_TEST_DEVICE_A_PASSWORD=<secret-a> MQTT_TEST_DEVICE_B_USERNAME=<device-b-id> npm run smoke:mqtt-acl
|
||||
```
|
||||
|
||||
Untuk staging/production-like baseline:
|
||||
|
||||
```bash
|
||||
BASE_URL=https://staging.example.com npm run load:test:staging
|
||||
```
|
||||
|
||||
Simpan report `reports/load-staging-*.json` bersama catatan release.
|
||||
|
||||
## Backup
|
||||
|
||||
Sebelum deploy besar dan minimal harian:
|
||||
|
||||
```bash
|
||||
npm run backup:production -- --out /var/backups/qris --include-mosquitto
|
||||
```
|
||||
|
||||
Pastikan backup disalin ke storage aman dan terenkripsi. File penting:
|
||||
|
||||
- Postgres dump `.dump`
|
||||
- Mosquitto passwd
|
||||
- Mosquitto ACL
|
||||
- Environment/secret reference di secret manager, bukan file plain text
|
||||
|
||||
## Restore Drill
|
||||
|
||||
1. Siapkan database disposable.
|
||||
2. Tampilkan rencana:
|
||||
|
||||
```bash
|
||||
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump
|
||||
```
|
||||
|
||||
3. Jalankan restore hanya ke database disposable:
|
||||
|
||||
```bash
|
||||
npm run restore:plan -- --backup /var/backups/qris/<dump>.dump -- --execute
|
||||
```
|
||||
|
||||
4. Start service mengarah ke DB restore.
|
||||
5. Validasi:
|
||||
|
||||
```bash
|
||||
npm run restore:validate
|
||||
```
|
||||
|
||||
## Rollback
|
||||
|
||||
1. Hentikan traffic ke release baru.
|
||||
2. Rollback service image/release ke versi sebelumnya.
|
||||
3. Jika migration baru hanya additive, jangan rollback database.
|
||||
4. Jika database harus dikembalikan, restore dari backup terbaru ke database disposable dulu, lalu promote sesuai prosedur infra.
|
||||
5. Jalankan `/health`, `/admin/health/deep`, dan smoke minimal.
|
||||
|
||||
## Incident Response
|
||||
|
||||
### API latency/error naik
|
||||
|
||||
1. Cek `/admin/observability/summary`.
|
||||
2. Cek log dengan `request_id`/`trace_id`.
|
||||
3. Cek Postgres connection dan slow query.
|
||||
4. Turunkan traffic atau rate limit jika perlu.
|
||||
|
||||
### MQTT publish/subscribe bermasalah
|
||||
|
||||
1. Cek `/admin/mqtt/status`.
|
||||
2. Cek broker service, certificate, ACL, dan passwd.
|
||||
3. Jalankan `npm run smoke:mqtt-real`.
|
||||
4. Untuk credential device, rotate via UI atau `npm run mqtt:provision-device`.
|
||||
|
||||
### Export macet
|
||||
|
||||
1. Cek `/admin/observability/summary` bagian `export_jobs`.
|
||||
2. Pastikan `EXPORT_STORAGE_DIR` writable.
|
||||
3. Restart worker/app untuk reset stale running job.
|
||||
4. Jika file expired, minta user membuat export baru.
|
||||
|
||||
### Login brute force
|
||||
|
||||
1. Cek audit log action `admin.login.failed` dan `merchant.login.failed`.
|
||||
2. Naikkan strictness `RATE_LIMIT_LOGIN_MAX`.
|
||||
3. Disable user mencurigakan via DB/admin tooling sementara.
|
||||
|
||||
## Routine Operations
|
||||
|
||||
- Harian: cek health/deep health, backup, MQTT status, failed notification.
|
||||
- Mingguan: restore drill sample, review audit failed login, review export storage usage.
|
||||
- Sebelum pilot device baru: provision credential, update broker passwd, validate ACL, smoke MQTT ACL.
|
||||
Reference in New Issue
Block a user