Troubleshooting
Use this page for deterministic diagnosis and remediation of common failures.
First Checks
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasksIf these checks fail, inspect orlojd and orlojworker logs first.
Common Issues
postgres backend selected but --postgres-dsn is empty
Cause:
--storage-backend=postgresis set without DSN.
Fix:
export ORLOJ_POSTGRES_DSN='postgres://orloj:orloj@127.0.0.1:5432/orloj?sslmode=disable'Unsupported backend values
Cause:
- invalid value for storage/event/message/tool-isolation backend flags.
Fix:
- storage:
memory|postgres - event bus (
orlojd):memory|nats - runtime message bus:
none|memory|nats-jetstream - tool isolation:
none|container|wasm
Workers never claim tasks
Checks:
- worker is
Readyand heartbeating - execution mode matches deployment mode
- model provider/auth is valid
- task requirements (
region,gpu,model) match worker capabilities
Commands:
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
go run ./cmd/orlojctl trace task <task-name>Message-driven flow not progressing
Cause:
- worker consumer is not enabled.
Fix:
- set
--agent-message-consume - set non-
none--agent-message-bus-backend
Tool calls fail with permission denials
Cause:
- governance policy denies requested action.
Fix:
- validate
Agent.spec.roles,AgentRole, andToolPermission. - inspect
tool_code,tool_reason, andretryablein trace metadata.
Model provider auth failures
Cause:
- missing or invalid provider key.
Fix:
- set
--model-gateway-api-keyor provider env var (OPENAI_API_KEY,ANTHROPIC_API_KEY,AZURE_OPENAI_API_KEY).
Wasm/container runtime errors
Cause:
- missing runtime binary/module path or invalid runtime configuration.
Fix:
- verify backend-specific settings (container runtime settings or wasm module/runtime configuration).
Observability Diagnostics
Logs are unstructured or missing request IDs
Cause:
ORLOJ_LOG_FORMATis not set or binary predates the structured logging migration.
Fix:
- Set
ORLOJ_LOG_FORMAT=json(default) to emit structured JSON logs withrequest_id,trace_id, andspan_idfields. - Set
ORLOJ_LOG_FORMAT=textfor human-readable output during local development.
Traces not appearing in Jaeger/Tempo
Cause:
OTEL_EXPORTER_OTLP_ENDPOINTis not set or the backend is unreachable.
Fix:
export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true # for non-TLS dev backendsRestart orlojd and orlojworker. Verify spans appear in the backend UI.
Prometheus /metrics returning 404
Cause:
- Running a build that predates the metrics endpoint addition.
Fix:
- Rebuild from the latest source and verify
curl http://127.0.0.1:8080/metricsreturns Prometheus text output.
Correlating a log entry with a trace
Use the trace_id field from a JSON log entry to search in your tracing backend:
# Find trace ID in logs
grep '"trace_id"' /var/log/orlojd.log | head -5Then search for that trace ID in Jaeger, Tempo, or the web console Trace tab.
Escalation Workflow
- Capture failing command and exact error text.
- Capture task trace:
go run ./cmd/orlojctl trace task <task-name>- Capture recent events:
go run ./cmd/orlojctl events --once --timeout=30s --raw- Capture relevant Prometheus metrics (if applicable):
curl -s http://127.0.0.1:8080/metrics | grep orloj_- File an issue with logs, trace, metrics, and manifest snippets.