Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

Observability

Orloj provides built-in observability through OpenTelemetry tracing, Prometheus metrics, structured logging, and an in-app trace visualization UI. These features work out of the box in OSS deployments and integrate with standard observability backends.

Trace Visualization (Web Console)

The web console includes a Trace tab on every task detail page. It renders the TaskTraceEvent data that the runtime already records during execution.

To view a task trace:

  1. Open the web console at http://<orlojd-address>/ui/.
  2. Navigate to a task and click into its detail page.
  3. Click the Trace tab.

The trace view shows:

  • Summary bar -- total events, cumulative latency, token count, tool calls, and error count.
  • Waterfall timeline -- each row is one trace event (agent start/end, tool call, model call, error, dead-letter). The horizontal bar shows time offset from task start and duration.
  • Filters -- filter by agent or branch when the task fans out across multiple agents.
  • Expandable detail rows -- click any row to see step ID, attempt, branch, tool name, tokens, error code/reason, and the full message.

The trace data comes from GET /v1/tasks/{name} (the status.trace field). No additional backend is required -- trace events are stored alongside the task resource.

OpenTelemetry Tracing

Orloj emits OpenTelemetry spans for task execution, agent steps, and message processing. Spans are exported via OTLP gRPC to any compatible backend (Jaeger, Grafana Tempo, Datadog, Honeycomb, etc.).

Enabling OTel Export

Set the OTLP endpoint via environment variable:

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317

Or for non-TLS backends in development:

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true

Both orlojd and orlojworker initialize the OTel trace provider on startup. When no endpoint is configured, a no-op provider is installed and tracing has zero overhead.

Span Hierarchy

Spans follow the task execution structure:

task.execute (root span)
├── agent.execute (one per agent step)
│   ├── model.call (model gateway invocations)
│   └── tool.execute (tool runtime calls)
└── ...

For message-driven execution, each message consumption creates a message.process span with a nested agent.execute span.

Span Attributes

All spans carry orloj.* attributes:

AttributeDescription
orloj.taskTask resource name
orloj.systemAgentSystem resource name
orloj.namespaceResource namespace
orloj.agentAgent resource name
orloj.step_idStep identifier (e.g. a1.s3)
orloj.attemptCurrent attempt number
orloj.tokens.usedTokens consumed by this step
orloj.tokens.estimatedEstimated tokens (when exact count unavailable)
orloj.tool_callsNumber of tool invocations
orloj.latency_msStep duration in milliseconds
orloj.message_idMessage ID (message-driven mode)
orloj.from_agentSource agent for message handoff
orloj.to_agentDestination agent for message handoff
orloj.branch_idBranch ID for fan-out tracking
orloj.toolTool name
orloj.tool.attemptTool retry attempt
orloj.modelModel identifier

W3C Trace Context

Orloj propagates traceparent and tracestate headers using the W3C Trace Context standard. This means external tools that support W3C propagation will automatically appear as child spans in your traces.

Dual Write

OTel spans are emitted in parallel with the internal Task.status.trace events. The internal trace powers the web console trace tab, while OTel spans flow to your external tracing backend. Both views are consistent.

Prometheus Metrics

Orloj exposes a standard Prometheus scrape endpoint at /metrics on the orlojd HTTP server. The endpoint is unauthenticated (like /healthz) so Prometheus can scrape it without API tokens.

Available Metrics

MetricTypeLabelsDescription
orloj_task_duration_secondshistogramnamespace, system, statusEnd-to-end task duration
orloj_agent_step_duration_secondshistogramagent, step_typeDuration of a single agent step
orloj_tokens_used_totalcounteragent, model, typeTokens consumed (type = used or estimated)
orloj_messages_totalcounterphase, agentMessage lifecycle transitions
orloj_deadletters_totalcounteragentMessages moved to dead-letter
orloj_retries_totalcounteragentMessage retry count
orloj_inflight_messagesgaugeagentCurrently in-flight messages

Prometheus Scrape Configuration

scrape_configs:
  - job_name: orloj
    static_configs:
      - targets: ['orlojd:8080']
    metrics_path: /metrics
    scrape_interval: 15s

Example Queries

Task success rate over the last hour:

sum(rate(orloj_task_duration_seconds_count{status="succeeded"}[1h]))
/
sum(rate(orloj_task_duration_seconds_count[1h]))

Token consumption by agent:

sum by (agent) (rate(orloj_tokens_used_total{type="used"}[5m]))

Dead-letter rate by agent:

sum by (agent) (rate(orloj_deadletters_total[5m]))

P95 agent step latency:

histogram_quantile(0.95, sum by (le, agent) (rate(orloj_agent_step_duration_seconds_bucket[5m])))

Structured Logging

Both orlojd and orlojworker emit structured JSON logs by default. Log output can be configured via the ORLOJ_LOG_FORMAT environment variable.

Configuration

VariableValuesDefaultDescription
ORLOJ_LOG_FORMATjson, textjsonLog output format. Use text for local development.

Log Fields

All log entries include a service field (orlojd or orlojworker). When processing HTTP requests, entries also include:

  • request_id -- unique ID for the request (propagated from X-Request-ID header or auto-generated)

When OpenTelemetry is enabled, log entries from traced code paths include:

  • trace_id -- OTel trace ID for correlation with spans
  • span_id -- OTel span ID

Request ID Propagation

The HTTP server automatically generates a request ID for each incoming request and returns it in the X-Request-ID response header. If the client sends an X-Request-ID header, it is reused. This enables end-to-end request correlation across services.

Correlating Logs with Traces

In Grafana, you can use the trace_id field to link from a log entry directly to the corresponding trace in Tempo or Jaeger. The trace ID in logs matches the OTel trace ID in exported spans.

Docker Compose Example

To run Orloj with Jaeger and Prometheus in a local development stack:

services:
  jaeger:
    image: jaegertracing/jaeger:2
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC
 
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
 
  orlojd:
    image: orloj:latest
    command: >
      orlojd --embedded-worker
      --model-gateway-provider=openai
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: jaeger:4317
      OTEL_EXPORTER_OTLP_INSECURE: "true"
      ORLOJ_LOG_FORMAT: json
    ports:
      - "8080:8080"

With the corresponding prometheus.yml:

scrape_configs:
  - job_name: orloj
    static_configs:
      - targets: ['orlojd:8080']
    scrape_interval: 15s

CLI Trace Inspection

For operators who prefer the CLI, orlojctl trace task prints the full trace timeline:

go run ./cmd/orlojctl trace task my-task

This is useful for quick debugging without opening the web console or an external tracing backend.

Related Docs