Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

Monitoring and Alerts

Use orloj-alertcheck and dashboard contracts to validate runtime reliability signals.

For Prometheus metrics, OpenTelemetry tracing, structured logging, and trace visualization, see Observability.

Purpose

This guide defines repeatable checks for retry storms, dead-letter growth, and latency saturation.

Artifacts

  • Alert profile (default): monitoring/alerts/retry-deadletter-default.json
  • Alert profile (CI): monitoring/alerts/retry-deadletter-ci.json
  • Dashboard contract: monitoring/dashboards/retry-deadletter-overview.json
  • Alert check command: cmd/orloj-alertcheck

The CI profile uses a lower min_tasks floor and a higher latency ceiling to accommodate CI runner variability. It is used by the reliability job in .github/workflows/ci.yml.

Alert Check Command

go run ./cmd/orloj-alertcheck \
  --base-url=http://127.0.0.1:8080 \
  --namespace=default \
  --profile=monitoring/alerts/retry-deadletter-default.json \
  --json=true

orloj-alertcheck Flags

FlagDefaultDescription
--base-urlhttp://127.0.0.1:8080Orloj API base URL.
--namespacedefaultTarget namespace for task queries.
--api-tokenemptyOptional bearer token for API auth (env fallback: ORLOJ_API_TOKEN).
--profilemonitoring/alerts/retry-deadletter-default.jsonAlert threshold profile JSON file.
--task-name-prefixemptyOptional filter by task name prefix.
--task-systememptyOptional filter by Task.spec.system.
--poll-concurrency20Concurrent task metrics fetch workers.
--timeout2mGlobal command timeout.
--jsontrueEmit machine-readable JSON output.
--verbosefalseEmit verbose progress logs.

For authoritative defaults and full CLI context, see CLI reference.

Loadtest Reliability Gates and Injection Controls

Use orloj-loadtest to validate system behavior under expected and fault-injected load patterns.

Key reliability gate controls:

  • --quality-profile
  • --min-success-rate
  • --max-deadletter-rate
  • --max-failed-rate
  • --max-timed-out
  • --min-retry-total
  • --min-takeover-events

Key injection controls:

  • --inject-invalid-system-rate, --invalid-system-name
  • --inject-timeout-system-rate, --timeout-system-name, --timeout-agent-name, --timeout-agent-duration
  • --inject-expired-lease-rate, --expired-lease-owner

Worker readiness and pacing controls:

  • --min-ready-workers, --worker-ready-timeout
  • --poll-concurrency, --poll-interval, --run-timeout

For exhaustive loadtest flags and defaults, see CLI reference.

Exit Behavior

  • 0: no violations
  • 2: one or more alert violations found
  • 1: command/config/API failure

Default Threshold Profile

The default profile checks:

  • retry storm absolute total and per-task rate
  • dead-letter absolute total and dead-letter task rate
  • in-flight saturation ceiling
  • max p95 latency ceiling (complement with orloj_agent_step_duration_seconds Prometheus histogram for live percentile queries)
  • optional require_any_task_succeeded

Dashboard Contract

monitoring/dashboards/retry-deadletter-overview.json defines backend-agnostic panel expectations for:

  • retry totals
  • dead-letter totals
  • dead-letter task rate
  • in-flight totals
  • max p95 latency