Monitoring and Alerts
Use orloj-alertcheck and dashboard contracts to validate runtime reliability signals.
For Prometheus metrics, OpenTelemetry tracing, structured logging, and trace visualization, see Observability.
Purpose
This guide defines repeatable checks for retry storms, dead-letter growth, and latency saturation.
Artifacts
- Alert profile (default):
monitoring/alerts/retry-deadletter-default.json - Alert profile (CI):
monitoring/alerts/retry-deadletter-ci.json - Dashboard contract:
monitoring/dashboards/retry-deadletter-overview.json - Alert check command:
cmd/orloj-alertcheck
The CI profile uses a lower min_tasks floor and a higher latency ceiling to accommodate CI runner variability. It is used by the reliability job in .github/workflows/ci.yml.
Alert Check Command
go run ./cmd/orloj-alertcheck \
--base-url=http://127.0.0.1:8080 \
--namespace=default \
--profile=monitoring/alerts/retry-deadletter-default.json \
--json=trueorloj-alertcheck Flags
| Flag | Default | Description |
|---|---|---|
--base-url | http://127.0.0.1:8080 | Orloj API base URL. |
--namespace | default | Target namespace for task queries. |
--api-token | empty | Optional bearer token for API auth (env fallback: ORLOJ_API_TOKEN). |
--profile | monitoring/alerts/retry-deadletter-default.json | Alert threshold profile JSON file. |
--task-name-prefix | empty | Optional filter by task name prefix. |
--task-system | empty | Optional filter by Task.spec.system. |
--poll-concurrency | 20 | Concurrent task metrics fetch workers. |
--timeout | 2m | Global command timeout. |
--json | true | Emit machine-readable JSON output. |
--verbose | false | Emit verbose progress logs. |
For authoritative defaults and full CLI context, see CLI reference.
Loadtest Reliability Gates and Injection Controls
Use orloj-loadtest to validate system behavior under expected and fault-injected load patterns.
Key reliability gate controls:
--quality-profile--min-success-rate--max-deadletter-rate--max-failed-rate--max-timed-out--min-retry-total--min-takeover-events
Key injection controls:
--inject-invalid-system-rate,--invalid-system-name--inject-timeout-system-rate,--timeout-system-name,--timeout-agent-name,--timeout-agent-duration--inject-expired-lease-rate,--expired-lease-owner
Worker readiness and pacing controls:
--min-ready-workers,--worker-ready-timeout--poll-concurrency,--poll-interval,--run-timeout
For exhaustive loadtest flags and defaults, see CLI reference.
Exit Behavior
0: no violations2: one or more alert violations found1: command/config/API failure
Default Threshold Profile
The default profile checks:
- retry storm absolute total and per-task rate
- dead-letter absolute total and dead-letter task rate
- in-flight saturation ceiling
- max p95 latency ceiling (complement with
orloj_agent_step_duration_secondsPrometheus histogram for live percentile queries) - optional
require_any_task_succeeded
Dashboard Contract
monitoring/dashboards/retry-deadletter-overview.json defines backend-agnostic panel expectations for:
- retry totals
- dead-letter totals
- dead-letter task rate
- in-flight totals
- max p95 latency