Upgrades and Rollbacks
This guide defines safe upgrade and rollback procedures for the Orloj server and workers.
Principles
- prefer staged rollouts over full replacement
- take Postgres backups before upgrades
- validate reliability gates before production promotion
- couple release behavior with contract documentation
Pre-Upgrade Checklist
- Read release notes and migration notes.
- Take a full Postgres backup per the Backup and Restore guide.
- Record the current
ORLOJ_SECRET_ENCRYPTION_KEY. - Verify baseline health (
/healthz, workers, task flow). - Run smoke checks in staging.
Upgrade Procedure
- Upgrade
orlojdin staging. - Verify API health and resource status.
- Upgrade one worker (canary).
- Validate task execution paths used by your deployment.
- Upgrade remaining workers.
- Run reliability checks:
orloj-loadtestorloj-alertcheck
Production Rollout
- canary one server instance and one worker first
- monitor task success/dead-letter ratio, retry volume, p95 latency, heartbeat stability
Rollback Triggers
- server health degradation
- retry/dead-letter rates exceed SLO thresholds
- unexpected increase in non-retryable runtime/policy failures
Rollback Procedure
- Revert server and worker binaries to previous release.
- Restore previous configuration values.
- Restore Postgres from backup if required (see Backup and Restore).
- Re-run smoke checks before resuming rollout.
Compatibility Guidance
- keep compatibility checks green for pinned downstream consumers
- avoid unversioned breaking changes on public contracts
- treat contract graduation and lifecycle changes as release events
Validation Commands
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
go run ./cmd/orloj-loadtest --quality-profile=monitoring/loadtest/quality-default.json --tasks=50
go run ./cmd/orloj-alertcheck --profile=monitoring/alerts/retry-deadletter-default.json