Operations Runbook
Use this runbook for baseline production operation and incident response.
Reference Topology
orlojdserver- Postgres state backend
- NATS JetStream for message-driven execution
- multiple
orlojworkerinstances
Startup Procedure
- Start Postgres and NATS.
- Start
orlojdwith--storage-backend=postgresand--task-execution-mode=message-driven. - Start at least two workers with
--agent-message-consume. - Configure model provider and credentials.
- Apply required resources (
ModelEndpoint,Tool,Agent,AgentSystem,Task, governance resources).
Verification
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
curl -s http://127.0.0.1:8080/metrics | head -20Expected result:
- API health endpoint reports healthy.
- Workers report
Readyand heartbeat updates. - Tasks transition through expected lifecycle.
/metricsreturns Prometheus text output withorloj_*metrics.
Failure and Recovery Expectations
- Worker crash: lease expires and another worker can claim.
- Retry behavior: delayed requeue until success or dead-letter.
- Policy/graph validation failures: non-retryable, deterministic dead-letter.
- Tool runtime denials/errors: normalized metadata in trace/log paths.
Observability
- Configure
OTEL_EXPORTER_OTLP_ENDPOINTon bothorlojdandorlojworkerfor distributed tracing. - Prometheus scrapes
/metricson theorlojdHTTP port -- add the target to your Prometheus scrape config. - Logs are structured JSON by default (
ORLOJ_LOG_FORMAT=json) withrequest_idandtrace_idfields. - The web console Trace tab shows task execution waterfall without any external backend.
- See Observability for full setup details.
Reliability Operations
- Run
go run ./cmd/orloj-loadtestfor repeatable load/failure validation. - Run
go run ./cmd/orloj-alertcheckto validate retry/dead-letter thresholds. - Keep alert and load profile thresholds aligned with SLO targets.