# Orloj Docs

> Runtime, governance, and orchestration for agent systems.

## API Reference

> **Stability: beta** -- This API surface ships with `orloj.dev/v1` and is suitable for production use, but may evolve with migration guidance in future minor releases.

This page summarizes key HTTP endpoints and behavior contracts.

### Resource CRUD

`/v1/<resource>` supports list/create and `/v1/<resource>/{name}` supports get/update/delete for:

* agents
* agent-systems
* model-endpoints
* tools
* secrets
* memories
* agent-policies
* agent-roles
* tool-permissions
* tasks
* task-schedules
* task-webhooks
* workers

Namespace defaults to `default` and can be overridden with `?namespace=<ns>`.

### Capabilities

* `GET /v1/capabilities`
  * returns deployment capability flags for feature discovery in UI/CLI integrations
  * extension providers may add capabilities without changing core API shape

### Authentication Endpoints

* `GET /v1/auth/config`
  * returns auth mode and login/setup requirements
* `POST /v1/auth/setup`
  * one-time native admin bootstrap when auth mode is `native`
* `POST /v1/auth/login`
  * local username/password login; sets session cookie
* `POST /v1/auth/logout`
  * clears local session cookie
* `GET /v1/auth/me`
  * returns current auth state for UI bootstrap
* `POST /v1/auth/admin/reset-password`
  * admin-authenticated local password rotation endpoint

### Status and Logs

* `GET|PUT /v1/<resource>/{name}/status`
* `GET /v1/agents/{name}/logs`
* `GET /v1/tasks/{name}/logs`

### Watches and Events

* `GET /v1/agents/watch`
* `GET /v1/tasks/watch`
* `GET /v1/task-schedules/watch`
* `GET /v1/task-webhooks/watch`
* `GET /v1/events/watch`

### Webhook Delivery

* `POST /v1/webhook-deliveries/{endpoint_id}`
  * public ingress for `TaskWebhook` delivery
  * returns `202 Accepted` for accepted or duplicate deliveries
  * relies on webhook auth configuration for signature and idempotency validation

#### Signature Profiles

* `generic`
  * signature: HMAC-SHA256 over `timestamp + "." + rawBody`
  * headers: `X-Signature: sha256=<hex>`, `X-Timestamp`, `X-Event-Id`
* `github`
  * signature: HMAC-SHA256 over raw body
  * headers: `X-Hub-Signature-256: sha256=<hex>`, `X-GitHub-Delivery`

Both profiles support replay protection through timestamp skew and/or event-id dedupe checks.

### Memory Entries

* `GET /v1/memories/{name}/entries`
  * query parameters:
    * `q` (string): search query. When provided, searches entries by keyword match (or vector similarity if the backend supports it).
    * `prefix` (string): filter entries by key prefix. Ignored when `q` is set.
    * `limit` (int): maximum number of entries to return. Defaults to `100`.
    * `namespace` (string): resource namespace. Defaults to `default`.
  * returns `{"entries": [{"key": "...", "value": "...", "score": 0.95}], "count": N}`
  * returns `404` if the Memory resource does not exist
  * returns an empty list if no persistent backend is registered for the Memory resource

### Task Observability Endpoints

* `GET /v1/tasks/{name}/messages`
  * filters: `phase`, `from_agent`, `to_agent`, `branch_id`, `trace_id`, `limit`
* `GET /v1/tasks/{name}/metrics`
  * includes totals and `per_agent`/`per_edge` rollups

`Task.status.trace[]` may include normalized tool metadata:

* `tool_contract_version`
* `tool_request_id`
* `tool_attempt`
* `error_code`
* `error_reason`
* `retryable`

### Request and Response Examples

#### Create a Resource

```
POST /v1/agents
Content-Type: application/json
```

```json
{
  "apiVersion": "orloj.dev/v1",
  "kind": "Agent",
  "metadata": {
    "name": "research-agent",
    "namespace": "default"
  },
  "spec": {
    "model_ref": "openai-default",
    "prompt": "You are a research assistant.",
    "tools": ["web_search"],
    "limits": {
      "max_steps": 6,
      "timeout": "30s"
    }
  }
}
```

Response (`201 Created`):

```json
{
  "apiVersion": "orloj.dev/v1",
  "kind": "Agent",
  "metadata": {
    "name": "research-agent",
    "namespace": "default",
    "resourceVersion": "1"
  },
  "spec": { "...": "..." },
  "status": {
    "phase": "Pending"
  }
}
```

#### Get a Resource

```
GET /v1/agents/research-agent?namespace=default
```

Returns the full resource including `metadata`, `spec`, and `status`.

#### Update a Resource

```
PUT /v1/agents/research-agent
Content-Type: application/json
If-Match: "1"
```

The request body must include the full resource. The `resourceVersion` (or `If-Match` header) must match the current version. Stale updates return `409 Conflict`.

#### Delete a Resource

```
DELETE /v1/agents/research-agent?namespace=default
```

Returns `200 OK` on success.

#### List Resources

```
GET /v1/agents?namespace=default
```

Returns an array of all resources of that type in the specified namespace.

##### Pagination

All list endpoints support cursor-based pagination via query parameters:

| Parameter   | Description                                                                                                                              |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `limit`     | Maximum number of items to return (1–1000).                                                                                              |
| `after`     | Cursor token — the `metadata.name` of the last item from the previous page. Returns items with names lexicographically after this value. |
| `namespace` | Filter by namespace.                                                                                                                     |

When more results are available, the response includes a `continue` field:

```json
{
  "continue": "task-00042",
  "items": [ ... ]
}
```

Pass `continue` as the `after` parameter in the next request to fetch the next page. When `continue` is absent or empty, there are no more results.

Offset-based pagination (`?offset=N`) is supported for backward compatibility on the Tasks endpoint but is deprecated in favor of `?after=`.

#### Watch Resources

```
GET /v1/agents/watch
```

Returns a server-sent event stream of resource changes. Events include the resource kind, name, and the change type (created, updated, deleted).

### Concurrency Semantics

* `PUT` requires `metadata.resourceVersion` or `If-Match`
* stale updates return `409 Conflict`

### Related Docs

* [Resource Reference](./resources.md)
* [CLI Reference](./cli.md)
* [Tool Contract v1](./tool-contract-v1.md)
* [Glossary](./glossary.md)


## CLI Reference

This page documents command-line interfaces for operating Orloj.

### Binaries

* `orlojctl`: resource CRUD, observability views, event stream tooling
* `orlojd`: API server, controllers, scheduler, optional embedded worker
* `orlojworker`: task worker and optional runtime inbox consumer
* `orloj-loadtest`: reliability/load harness
* `orloj-alertcheck`: alert profile evaluator

### `orlojctl`

Usage patterns:

```text
orlojctl apply -f <resource.yaml>
orlojctl create secret <name> --from-literal key=value [...]
orlojctl get [-w] <resource>
orlojctl delete <resource> <name>
orlojctl run --system <name> [key=value ...]
orlojctl init --blueprint pipeline|hierarchical|swarm-loop [--name <prefix>] [--dir <path>]
orlojctl logs <agent-name>|task/<task-name>
orlojctl trace task <task-name>
orlojctl graph system|task <name>
orlojctl events [filters...]
orlojctl admin reset-password --new-password <value> [--username <name>]
orlojctl config path|get|use <name>|set-profile <name> [--server URL] [--token value] [--token-env NAME]
```

#### Remote API and authentication

For **hosted / remote** control planes: operator-generated bearer tokens, env defaults, `orlojctl config` profiles, when `config.json` is created, and full precedence rules—see **[Remote CLI and API access](../deployment/remote-cli-access.md)**.

Quick reference:

* **Token order:** `--api-token`, then `ORLOJCTL_API_TOKEN`, then `ORLOJ_API_TOKEN`, then active profile `token` / `token_env`.
* **Default `--server`:** `ORLOJCTL_SERVER`, `ORLOJ_SERVER`, active profile `server`, else `http://127.0.0.1:8080`.
* Token generation and server configuration: [Control plane API tokens](../operations/security.md#control-plane-api-tokens).

```bash
orlojctl config path
orlojctl config set-profile production --server https://orloj.example.com --token-env ORLOJ_PROD_TOKEN
orlojctl config use production
```

#### `orlojctl create secret`

Imperative secret creation without a YAML file. Builds a `Secret` resource from `--from-literal` flags and applies it to the server.

```bash
orlojctl create secret openai-api-key --from-literal value=sk-proj-abc123
```

Multiple keys:

```bash
orlojctl create secret provider-keys \
  --from-literal openai=sk-proj-abc123 \
  --from-literal anthropic=sk-ant-xyz789
```

Flags:

* `--from-literal` (required, repeatable): `key=value` pair to include in the secret.
* `--namespace` (default `default`): namespace for the secret.
* `--server` (same default resolution as [Remote CLI and API access](../deployment/remote-cli-access.md#precedence)): Orloj server URL.

Values are automatically base64-encoded via `stringData` semantics. The plaintext never touches disk.

#### `orlojctl run`

Imperative task execution. Creates a task targeting the specified AgentSystem, polls until completion, and prints the result.

```bash
orlojctl run --system report-system topic="AI copilots" depth=detailed
```

Flags:

* `--system` (required): AgentSystem to execute.
* `--namespace` (default `default`): namespace for the task.
* `--poll` (default `2s`): status polling interval.
* `--timeout` (default `5m`): maximum wait time.

Positional arguments after flags are parsed as `key=value` input pairs.

#### `orlojctl init`

Scaffold a new agent system from a blueprint template. Generates agent manifests, an agent-system graph, and a task file in the target directory.

```bash
orlojctl init --blueprint pipeline --name my-project --dir ./agents
```

Flags:

* `--blueprint` (required): Blueprint to scaffold (`pipeline`, `hierarchical`, `swarm-loop`).
* `--name` (default: blueprint name): Prefix for generated resource names.
* `--dir` (default: `.`): Output directory. Creates an `agents/` subdirectory for agent manifests.

Generated files:

* `agents/<role>_agent.yaml` -- one per agent in the topology
* `agent-system.yaml` -- the agent graph with edges
* `task.yaml` -- a starter task targeting the system

Supported resources:

* `agents`
* `agent-systems`
* `model-endpoints`
* `tools`
* `secrets`
* `memories`
* `agent-policies`
* `agent-roles`
* `tool-permissions`
* `tasks`
* `task-schedules`
* `task-webhooks`
* `workers`

Common flags:

* `--server` (default: `ORLOJCTL_SERVER`, `ORLOJ_SERVER`, active profile, then `http://127.0.0.1:8080`)
* `--api-token` (global; also via `ORLOJCTL_API_TOKEN`, `ORLOJ_API_TOKEN`, or profile `token` / `token_env`)
* `--namespace` on namespaced operations
* `-w` for watch mode on supported `get` commands
* `events` filters: `--source`, `--type`, `--kind`, `--name`, `--namespace`, `--since`, `--once`, `--timeout`, `--raw`

### `orlojd`

Print full flags:

```bash
go run ./cmd/orlojd -h
```

Critical flags:

* `--addr`
* `--auth-mode` (`off|native|sso`; OSS supports `off` and `native`)
* `--auth-session-ttl`
* `--api-key` (enable bearer token auth; env fallback: `ORLOJ_API_TOKEN`)
* `--auth-reset-admin-username`, `--auth-reset-admin-password` (one-shot local admin password reset and exit)
* `--secret-encryption-key` (256-bit AES key for encrypting Secret data at rest; env fallback: `ORLOJ_SECRET_ENCRYPTION_KEY`)
* `--storage-backend` (`memory|postgres`)
* `--postgres-dsn`
* `--task-execution-mode` (`sequential|message-driven`)
* `--embedded-worker` (run a built-in worker in the server process)
* `--embedded-worker-max-concurrent-tasks` (capacity registered for the embedded worker; env `ORLOJ_EMBEDDED_WORKER_MAX_CONCURRENT_TASKS`; default `1`, same idea as `orlojworker --max-concurrent-tasks`)
* `--event-bus-backend` (`memory|nats`)
* `--agent-message-bus-backend` (`none|memory|nats-jetstream`)
* `--model-gateway-provider` (`mock|openai|anthropic|azure-openai|ollama`)
* `--tool-isolation-backend` (`none|container|wasm`)

### `orlojworker`

Print full flags:

```bash
go run ./cmd/orlojworker -h
```

Critical flags:

* `--worker-id`
* `--secret-encryption-key` (must match the key used by `orlojd`)
* `--storage-backend` (`memory|postgres`)
* `--postgres-dsn`
* `--task-execution-mode` (`sequential|message-driven`)
* `--agent-message-consume`
* `--agent-message-bus-backend` (`none|memory|nats-jetstream`)
* `--model-gateway-provider`
* `--tool-isolation-backend`

### `orloj-loadtest`

Print full flags:

```bash
go run ./cmd/orloj-loadtest -h
```

Primary controls:

* `--tasks`
* `--create-concurrency`
* `--poll-concurrency`
* `--quality-profile`
* `--inject-invalid-system-rate`
* `--inject-timeout-system-rate`
* `--inject-expired-lease-rate`

### `orloj-alertcheck`

Print full flags:

```bash
go run ./cmd/orloj-alertcheck -h
```

Primary controls:

* `--profile`
* `--namespace`
* `--task-system`
* `--task-name-prefix`
* `--api-token`

### Command Discovery

Use binary help output as the authoritative source for your current build:

```bash
go run ./cmd/orlojctl help
go run ./cmd/orlojd -h
go run ./cmd/orlojworker -h
```


## Extension Contracts

> **Stability: beta** -- Extension interfaces are functional and in use, but may evolve additively in future releases.

Orloj core exposes optional extension hooks for additive cloud and enterprise integrations. Extensions allow commercial or custom features to plug into the runtime without modifying the open-source core.

### Design Rules

* OSS defaults remain functional with no extension hooks configured.
* Extension behavior is additive and must not alter baseline OSS semantics.
* Consumers should target public interfaces instead of patching core internals.

### Runtime Interfaces

#### MeteringSink

Records usage and billing events for external consumption.

```go
type MeteringSink interface {
    RecordMetering(ctx context.Context, event MeteringEvent) error
}

type MeteringEvent struct {
    Timestamp   time.Time
    Namespace   string
    Task        string
    Agent       string
    Model       string
    TokensIn    int
    TokensOut   int
    ToolCalls   int
    DurationMs  int64
}
```

Use cases: usage-based billing, cost attribution per team/system, token consumption dashboards.

#### AuditSink

Records audit events for compliance and observability pipelines.

```go
type AuditSink interface {
    RecordAudit(ctx context.Context, event AuditEvent) error
}

type AuditEvent struct {
    Timestamp  time.Time
    Action     string   // "tool_invoke", "policy_deny", "task_create", etc.
    Actor      string   // agent or user identifier
    Resource   string   // resource kind and name
    Namespace  string
    Outcome    string   // "allowed", "denied", "error"
    Details    map[string]string
}
```

Use cases: compliance logging, security audits, governance event streams.

#### CapabilityProvider

Exposes deployment capabilities for feature discovery in UI and CLI integrations.

```go
type CapabilityProvider interface {
    Capabilities(ctx context.Context) (CapabilitySnapshot, error)
}

type CapabilitySnapshot struct {
    Features map[string]bool
    Metadata map[string]string
}
```

The snapshot is served at `GET /v1/capabilities`. Extension providers may add capabilities without changing the core API shape. The UI and CLI use this endpoint to enable or disable features based on what the deployment supports.

### Implementing an Extension

1. Implement one or more of the interfaces above.
2. Register the implementation with the runtime at startup (via configuration or plugin loading).
3. The runtime calls your implementation at the appropriate hook points during execution.

Extensions run in-process with the server or worker. They should be fast and non-blocking -- the runtime does not isolate extension failures from core execution.

### Compatibility Expectations

* Interfaces evolve additively by default.
* Breaking changes require versioning and migration guidance.
* Compatibility checks should run against pinned consumer references before release.

### Related Docs

* [Observability](../operations/observability.md) -- OSS tracing, metrics, and logging


## Glossary

Canonical definitions for terms used throughout Orloj documentation.

### A

**Agent**
A declarative unit of work backed by a language model. Defined as a resource with a prompt, model configuration, tool bindings, role assignments, and execution limits. See [Agents and Agent Systems](../concepts/agents-and-systems.md).

**Agent System**
A composition of multiple agents wired into a directed graph. The graph defines how messages flow between agents during task execution. Supports pipeline, hierarchical, and swarm-loop topologies. See [Agents and Agent Systems](../concepts/agents-and-systems.md).

**Agent Policy**
A governance resource that constrains agent execution. Can restrict allowed models, block specific tools, and cap token usage. Policies may be scoped to specific systems/tasks or applied globally. See [Governance and Policies](../concepts/governance.md).

**Agent Role**
A named set of permission strings that can be bound to agents. Agents accumulate the union of permissions from all bound roles. See [Governance and Policies](../concepts/governance.md).

### B

**Blueprint**
A ready-to-use template combining agents, an agent system, and a task for a specific orchestration pattern (pipeline, hierarchical, or swarm-loop). Available in `examples/blueprints/`. See [Starter Blueprints](../architecture/starter-blueprints.md).

### C

**Server**
The management layer of Orloj, running as `orlojd`. Includes the API server, resource store, background services, and task scheduler. See [Architecture Overview](../architecture/overview.md).

**Resource Definition**
A typed, declarative schema. Orloj defines 13 resource types with standard `apiVersion`, `kind`, `metadata`, `spec`, and `status` fields. See [Resource Reference](./resources.md).

### D

**Dead Letter**
A terminal state for tasks or messages that have exhausted all retry attempts. Dead-lettered items require manual investigation. Tasks transition `Failed -> DeadLetter` after all retries are consumed.

### E

**Edge**
A directional connection between two agents in an AgentSystem graph. Edges define message routing. The `edges[]` field supports fan-out (multiple targets) and metadata annotations via labels and policy.

### F

**Fan-in**
A graph pattern where multiple upstream branches converge on a single downstream node. Controlled by join gates with `wait_for_all` or `quorum` modes. See [Execution and Messaging](../architecture/execution-model.md).

**Fan-out**
A graph pattern where a single node routes messages to multiple downstream targets simultaneously.

### G

**Governance**
The authorization and policy enforcement layer. Composed of AgentPolicy, AgentRole, and ToolPermission resources. Governance is fail-closed: unauthorized actions are denied, not silently ignored. See [Governance and Policies](../concepts/governance.md).

### J

**Join Gate**
A fan-in mechanism on an AgentSystem graph node. Modes: `wait_for_all` (wait for every upstream branch) or `quorum` (wait for a count/percentage). Configurable failure policy: `deadletter`, `skip`, or `continue_partial`.

### L

**Lease**
A time-bounded claim on a task held by a worker. Workers renew leases via heartbeats during execution. If a lease expires (worker crash, network partition), another worker may safely take over the task.

### M

**Memory**
A resource that configures a persistent memory backend for agents. Agents attach a Memory resource via `spec.memory.ref`, and may explicitly grant built-in memory operations with `spec.memory.allow` (`read`, `write`, `search`, `list`, `ingest`). Configured with a type, provider (e.g. `in-memory`, `pgvector`), and optional embedding model. Memory operates in three layers: conversation history (per-activation), task-scoped shared store (per-task), and persistent backends (cross-task). See [Memory](../concepts/memory/index.md).

**Memory Tool**
One of five built-in runtime tools that can be exposed when an agent both references a Memory resource and explicitly allows the corresponding operation: `memory.read`, `memory.write`, `memory.search`, `memory.list`, and `memory.ingest`. These are handled internally by the runtime without network calls. See [Memory](../concepts/memory/index.md).

**Message Bus**
The transport layer for agent-to-agent communication within a task. Implementations: `memory` (in-process) and `nats-jetstream` (durable). Messages carry lifecycle phase, retry state, and routing metadata.

**Model Endpoint**
A resource that configures a connection to a model provider. Declares the provider type, base URL, default model, provider-specific options, and auth credentials. Agents reference endpoints by name via `model_ref`. See [Model Routing](../concepts/model-routing.md).

**Model Gateway**
The worker component that routes model requests to the appropriate provider based on agent configuration. Handles provider-specific request formatting and response parsing.

### N

**Namespace**
A scope for resource names. Defaults to `default`. Resources can reference cross-namespace targets using `namespace/name` syntax.

### R

**Reconciliation**
The process by which background services observe the current state of a resource and take actions to move it toward the desired state declared in `spec`.

### S

**Secret**
A resource for storing sensitive values (API keys, tokens). `stringData` values are base64-encoded into `data` during normalization and then cleared. The runtime reads from `data` at execution time.

### T

**Task**
A request to execute an AgentSystem with specific input. Tasks move through phases: `Pending -> Running -> Succeeded | Failed | DeadLetter`. See [Tasks and Scheduling](../concepts/tasks-and-scheduling.md).

**Task Schedule**
A resource that creates tasks on a cron-based schedule from a template task. Supports timezone configuration, concurrency policy, and history limits. See [Tasks and Scheduling](../concepts/tasks-and-scheduling.md).

**Task Webhook**
A resource that creates tasks in response to external HTTP events. Supports signature verification (generic and GitHub profiles) and idempotency-based deduplication. See [Tasks and Scheduling](../concepts/tasks-and-scheduling.md).

**Tool**
An external capability that agents can invoke during execution. Defined as a resource with endpoint, auth, risk level, and runtime configuration (isolation, timeout, retry). See [Tools and Isolation](../concepts/tools-and-isolation.md).

**Tool Contract v1**
The standardized JSON request/response envelope that all tools must implement. Defines the error taxonomy (`tool_code`, `tool_reason`, `retryable`) used by the runtime for retry decisions. See [Tool Contract v1](./tool-contract-v1.md).

**Tool Permission**
A governance resource that defines what permissions are required to invoke a specific tool. Checked against the agent's accumulated role permissions at execution time. See [Governance and Policies](../concepts/governance.md).

### W

**Worker**
An execution unit that claims and runs tasks. Workers register capabilities (region, GPU, supported models) and the scheduler uses these for task matching. Runs as `orlojworker`. See [Tasks and Scheduling](../concepts/tasks-and-scheduling.md) and [Architecture Overview](../architecture/overview.md).


## Reference

Detailed contracts and schemas for API consumers, runtime integrators, and platform engineers.

### Core Interfaces

* [CLI Reference](./cli.md)
* [API Reference](./api.md)
* [Resource Reference](./resources.md)
* [Extension Contracts](./extensions.md)

### Runtime Contracts

* [Tool Contract v1](./tool-contract-v1.md)
* [WASM Tool Module Contract v1](./wasm-tool-module-contract-v1.md)

### Related Operator Guidance

* [Tool Runtime Conformance](../operations/tool-runtime-conformance.md)


## Resource Reference

> **Stability: beta** -- All resource kinds under `orloj.dev/v1` are suitable for production use, but their schemas may evolve with migration guidance in future minor releases.

This document describes the current resource schemas in `orloj.dev/v1`, based on the runtime types and normalization logic in:

* `resources/agent.go`
* `resources/model_endpoint.go`
* `resources/resource_types.go`
* `resources/graph.go`

### Common Conventions

* Every resource uses standard top-level fields: `apiVersion`, `kind`, `metadata`, `spec`, `status`.
* `metadata.name` is required for all resources.
* `metadata.namespace` defaults to `default` when omitted.
* Most resources default `status.phase` to `Pending` during normalization.

### Resource Kinds

* `Agent`
* `AgentSystem`
* `ModelEndpoint`
* `Tool`
* `Secret`
* `Memory`
* `AgentPolicy`
* `AgentRole`
* `ToolPermission`
* `ToolApproval`
* `Task`
* `TaskSchedule`
* `TaskWebhook`
* `Worker`

### Agent

#### `spec`

* `model_ref` (string): reference to a `ModelEndpoint` (`name` or `namespace/name`).
* `prompt` (string): agent instruction prompt.
* `tools` (\[]string): tool names available to the agent.
* `allowed_tools` (\[]string): tools pre-authorized without RBAC. Bypasses AgentRole/ToolPermission checks for listed tools.
* `roles` (\[]string): bound `AgentRole` names.
* `memory` (object):
  * `ref` (string): reference to a `Memory` resource. This attaches the memory backend to the agent. See [Memory](../concepts/memory/index.md).
  * `allow` (\[]string): explicit built-in memory operations allowed for the agent: `read`, `write`, `search`, `list`, `ingest`.
  * `type` (string)
  * `provider` (string)
* `limits` (object):
  * `max_steps` (int)
  * `timeout` (string duration)
* `execution` (object): optional per-agent execution contract.
  * `profile` (string): `dynamic` (default) or `contract`.
  * `tool_sequence` (\[]string): required tool names when `profile=contract`. Tracked as a set (order-independent).
  * `required_output_markers` (\[]string): strings that should appear in final model output when `profile=contract`. Treated as best-effort: missing markers at `max_steps` produce a warning, not a hard failure, when all tools completed.
  * `duplicate_tool_call_policy` (string): `short_circuit` (default) or `deny`. In `short_circuit` mode, duplicate tool calls reuse cached results and inject a completion hint. This applies to **all profiles**, not just `contract`.
  * `on_contract_violation` (string): `observe` or `non_retryable_error` (default). In `observe` mode, violations are logged as telemetry events but do not stop execution or deadletter the task.
  * `tool_use_behavior` (string): Controls what happens after a tool call succeeds. See [Tool Use Behavior](#tool-use-behavior) below.

##### Tool Use Behavior

The `tool_use_behavior` field controls whether the model gets another turn after a successful tool call. This is the primary lever for optimizing token usage in tool-calling agents.

| Value                     | Model calls                                                             | When to use                                                                                                                 |
| ------------------------- | ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| `run_llm_again` (default) | Tool call + follow-up model call to process the result                  | The agent needs to **interpret, format, or synthesize** the tool output before handing off. Most agents need this.          |
| `stop_on_first_tool`      | Tool call only -- tool output becomes the agent's final output directly | The agent is a **relay** that calls a tool and passes raw data to the next agent in the pipeline. No interpretation needed. |

**Example: `run_llm_again` (default)**

An analyst agent calls an API tool, then needs to produce labeled output from the raw response:

```yaml
kind: Agent
metadata:
  name: analyst-agent
spec:
  prompt: "Call the API, then return SUMMARY: and EVIDENCE: labels."
  tools:
    - external-api-tool
  # tool_use_behavior defaults to run_llm_again -- agent gets a
  # second model call to read the tool result and produce labels.
```

Step 1: model calls `external-api-tool` → Step 2: model reads tool result, produces labeled output → done (2 model calls).

**Example: `stop_on_first_tool`**

A fetcher agent's only job is to call a tool and pass the raw result downstream:

```yaml
kind: Agent
metadata:
  name: fetcher-agent
spec:
  prompt: "Fetch the latest data from the API."
  tools:
    - external-api-tool
  execution:
    tool_use_behavior: stop_on_first_tool
  # Agent exits immediately after the tool returns.
  # Raw tool output becomes the agent's output -- no extra model call.
```

Step 1: model calls `external-api-tool` → done (1 model call). The next agent in the pipeline receives the raw tool response as context.

**When NOT to use `stop_on_first_tool`:**

* The agent needs to produce structured/labeled output from the tool result.
* The agent has multiple tools and may need to call more than one.
* The agent needs to reason about the tool result before responding.

#### Defaults and Validation

* `model_ref` is required.
* `roles` are trimmed and deduplicated (case-insensitive).
* `memory.allow` is trimmed, normalized, and deduplicated. It requires `memory.ref`.
* `limits.max_steps` defaults to `10` when `<= 0`.
* `execution.profile` defaults to `dynamic`.
* `execution.duplicate_tool_call_policy` defaults to `short_circuit`. Applies to all profiles.
* `execution.on_contract_violation` defaults to `non_retryable_error`. Set to `observe` for safe production rollout.
* `execution.tool_use_behavior` defaults to `run_llm_again`.
* `execution.tool_sequence` and `execution.required_output_markers` are trimmed and deduplicated.
* When `execution.profile=contract`, `execution.tool_sequence` is required.
* Tool sequence is tracked as a set: tools may be called in any order.
* When all tools in `tool_sequence` complete but `required_output_markers` are not satisfied at `max_steps`, the task completes with a `contract_warning` event instead of deadlettering.

**Structured tool protocol:** Tool results are sent to the model using the provider's native structured tool calling protocol (OpenAI `role: "tool"` with `tool_call_id`, Anthropic `tool_result` content blocks). This gives the model structured evidence that a tool was already called, preventing unnecessary repeat calls.

**Scaling ladder for cost control:**

1. `profile: dynamic` (default): structured tool protocol prevents repeat calls. Succeeded tools are filtered from the available tools list. No YAML changes needed.
2. `tool_use_behavior: stop_on_first_tool`: for pipeline stages that pass raw data, eliminates all extra model calls (1 model call + 1 tool call total).
3. `profile: contract` + `on_contract_violation: observe`: adds guaranteed early completion when all tools succeed plus telemetry for contract deviations.
4. `profile: contract` + `on_contract_violation: non_retryable_error`: hard enforcement for critical pipeline stages. Violations deadletter the task.

#### `status`

* `phase`, `lastError`, `observedGeneration`

Example: `examples/resources/agents/*.yaml`

### AgentSystem

#### `spec`

* `agents` (\[]string): participating agent names.
* `graph` (map\[string]GraphEdge): per-node routing.

`GraphEdge` fields:

* `next` (string): legacy single-hop route.
* `edges` (\[]GraphRoute): fan-out routes.
  * `to` (string)
  * `labels` (map\[string]string)
  * `policy` (map\[string]string)
* `join` (GraphJoin): fan-in behavior.
  * `mode`: `wait_for_all` or `quorum`
  * `quorum_count` (int, >= 0)
  * `quorum_percent` (int, 0-100)
  * `on_failure`: `deadletter`, `skip`, `continue_partial`

#### Defaults and Validation

* `graph[*].next` and `graph[*].edges[].to` are trimmed.
* Route targets are normalized/deduplicated for execution.
* `join` normalization defaults:
  * `mode` -> `wait_for_all`
  * `on_failure` -> `deadletter`
  * `quorum_percent` clamped to `0..100`
  * invalid values are coerced to safe defaults in graph normalization.
* Runtime task validation additionally checks:
  * graph nodes/edges must reference agents in `spec.agents`
  * cyclic graphs require `Task.spec.max_turns > 0`
  * non-cyclic graphs require at least one entrypoint (zero indegree node)

#### `status`

* `phase`, `lastError`, `observedGeneration`

Example: `examples/resources/agent-systems/*.yaml`

### ModelEndpoint

#### `spec`

* `provider` (string): provider id (`openai`, `anthropic`, `azure-openai`, `ollama`, `mock`, or registry-added providers).
* `base_url` (string)
* `default_model` (string)
* `options` (map\[string]string): provider-specific options.
* `auth.secretRef` (string): namespaced reference to a `Secret`.

#### Defaults and Validation

* `provider` defaults to `openai` and is normalized to lowercase.
* `base_url` defaults by provider:
  * `openai` -> `https://api.openai.com/v1`
  * `anthropic` -> `https://api.anthropic.com/v1`
  * `ollama` -> `http://127.0.0.1:11434`
* `options` keys are normalized to lowercase; keys/values are trimmed.

#### `status`

* `phase`, `lastError`, `observedGeneration`

Example: `examples/resources/model-endpoints/*.yaml`

### Tool

#### `spec`

* `type` (string): tool type. Allowed values: `http`, `external`, `grpc`, `webhook-callback`, `queue`, `mcp`. Unknown values are rejected at apply time.
* `endpoint` (string): tool endpoint URL (or `host:port` for gRPC).
* `description` (string): human-readable description of the tool. Passed to model gateways for richer tool definitions. Auto-populated for MCP-generated tools.
* `input_schema` (object): JSON Schema for tool parameters. Passed to model gateways for structured parameter definitions. Auto-populated for MCP-generated tools.
* `mcp_server_ref` (string): name of the McpServer that provides this tool. Required when `type=mcp`.
* `mcp_tool_name` (string): the tool name as reported by the MCP server's `tools/list`. Required when `type=mcp`.
* `capabilities` (\[]string): declared operations.
* `operation_classes` (\[]string): operation class annotations. Allowed values: `read`, `write`, `delete`, `admin`. Used by `ToolPermission.operation_rules` for per-class policy verdicts.
* `risk_level` (string): `low`, `medium`, `high`, `critical`.
* `runtime` (object):
  * `timeout` (duration string)
  * `isolation_mode`: `none`, `sandboxed`, `container`, `wasm`
  * `retry.max_attempts` (int)
  * `retry.backoff` (duration string)
  * `retry.max_backoff` (duration string)
  * `retry.jitter`: `none`, `full`, `equal`
* `auth` (object):
  * `profile` (string): auth profile. Allowed values: `bearer`, `api_key_header`, `basic`, `oauth2_client_credentials`. Defaults to `bearer` when `secretRef` is set.
  * `secretRef` (string): namespaced secret reference. Required when `profile` is set.
  * `headerName` (string): custom header name. Required when `profile=api_key_header`.
  * `tokenURL` (string): OAuth2 token endpoint. Required when `profile=oauth2_client_credentials`.
  * `scopes` (\[]string): OAuth2 scopes.

#### Defaults and Validation

* `type` defaults to `http`. Unknown types are rejected with a validation error. `mcp` type tools are typically auto-generated by the McpServer controller; see [Connect an MCP Server](../guides/connect-mcp-server.md).
* `auth.profile` defaults to `bearer` when `secretRef` is set. Unknown profiles are rejected.
* `auth.headerName` is required when `profile=api_key_header`.
* `auth.tokenURL` is required when `profile=oauth2_client_credentials`.
* `capabilities` are trimmed and deduplicated (case-insensitive).
* `operation_classes` are trimmed, lowercased, and deduplicated. Invalid values are rejected. Defaults to `["read"]` for `low`/`medium` risk, `["write"]` for `high`/`critical` risk.
* `risk_level` defaults to `low`.
* `runtime.timeout` defaults to `30s` and must parse as duration.
* `runtime.isolation_mode` defaults to:
  * `sandboxed` for `high`/`critical` risk
  * `none` for `low`/`medium` risk
* `runtime.retry` defaults:
  * `max_attempts` -> `1`
  * `backoff` -> `0s`
  * `max_backoff` -> `30s`
  * `jitter` -> `none`

#### `status`

* `phase`, `lastError`, `observedGeneration`

Examples:

* `examples/resources/tools/*.yaml`
* `examples/resources/tools/wasm-reference/wasm_echo_tool.yaml`

### Secret

#### `spec`

* `data` (map\[string]string): base64-encoded values.
* `stringData` (map\[string]string): write-only plaintext convenience input.

#### Defaults and Validation

* `stringData` entries are merged into `data` as base64 during normalization.
* Every `data` value must be non-empty valid base64.
* `stringData` is cleared after normalization (write-only behavior).

#### `status`

* `phase`, `lastError`, `observedGeneration`

Examples: `examples/resources/secrets/*.yaml`

### Memory

A Memory resource configures a persistent memory backend that agents can read from and write to using built-in memory tools. See [Memory Concepts](../concepts/memory/index.md) for a full overview.

#### `spec`

* `type` (string): categorization of the memory use case (e.g. `vector`, `kv`). Informational in v1.
* `provider` (string): backend implementation. Built-in values:
  * `in-memory` (default): in-process key-value store. No endpoint needed. Data is lost on restart.
  * `pgvector`: PostgreSQL with the pgvector extension. Full vector-similarity search. Requires `endpoint` (Postgres DSN) and `embedding_model` (ModelEndpoint reference). See [pgvector](../concepts/memory/providers.md#pgvector).
  * `http`: delegates to an external HTTP service. Requires `endpoint`. See [HTTP Adapter](../concepts/memory/providers.md#http-adapter).
  * **Coming soon:** Qdrant, Pinecone, Weaviate, Chroma, Milvus. Custom providers can also be registered via the Go provider registry.
* `embedding_model` (string): reference to a ModelEndpoint resource that provides an OpenAI-compatible `/embeddings` API. Required for vector providers like `pgvector`. The endpoint's `base_url`, `auth`, and `default_model` are used to generate embeddings. Resolved in the same namespace by default; use `namespace/name` for cross-namespace references.
* `endpoint` (string): connection string or URL. For `pgvector`, a Postgres DSN (e.g. `postgres://user@host:5432/db`). For `http`, the adapter service URL. Not needed for `in-memory`.
* `auth` (object):
  * `secretRef` (string): reference to a Secret resource containing credentials. For `http`, used as a bearer token. For `pgvector`, injected as the Postgres password into the DSN.

#### Defaults and Validation

* `provider` defaults to `in-memory` when omitted or empty.
* `endpoint` is required when `provider` is `pgvector`, `http`, or any cloud-hosted built-in provider.
* `embedding_model` is required when `provider` is `pgvector`. It must reference a valid ModelEndpoint.
* When `auth.secretRef` is set, the controller resolves the Secret and passes the token to the provider.
* The Memory controller validates the provider, resolves auth, and performs a connectivity check (`Ping`). Unsupported providers, missing secrets, or failed connectivity moves the resource to `Error` phase.

#### Built-in Memory Tools

When an Agent references a Memory resource via `spec.memory.ref` and explicitly grants operations with `spec.memory.allow`, the runtime exposes the following built-in tools:

| Tool            | Description                                                |
| --------------- | ---------------------------------------------------------- |
| `memory.read`   | Retrieve a value by key.                                   |
| `memory.write`  | Store a key-value pair.                                    |
| `memory.search` | Search entries by keyword (or vector similarity).          |
| `memory.list`   | List entries, optionally filtered by key prefix.           |
| `memory.ingest` | Chunk a document into overlapping segments and store them. |

These tools do not need to be listed in the agent's `spec.tools` -- they are injected automatically.

#### `status`

* `phase`: `Pending`, `Ready`, or `Error`.
* `lastError`: description of the most recent error (e.g. unsupported provider, connectivity failure).
* `observedGeneration`

Example: `examples/resources/memories/research_memory.yaml`

### AgentPolicy

#### `spec`

* `max_tokens_per_run` (int)
* `allowed_models` (\[]string)
* `blocked_tools` (\[]string)
* `apply_mode` (string): `scoped` or `global`
* `target_systems` (\[]string)
* `target_tasks` (\[]string)

#### Defaults and Validation

* `apply_mode` defaults to `scoped`.
* `apply_mode` must be `scoped` or `global`.

#### `status`

* `phase`, `lastError`, `observedGeneration`

Example: `examples/resources/agent-policies/cost_policy.yaml`

### AgentRole

#### `spec`

* `description` (string)
* `permissions` (\[]string): normalized permission strings.

#### Defaults and Validation

* `permissions` are trimmed and deduplicated (case-insensitive).

#### `status`

* `phase`, `lastError`, `observedGeneration`

Examples: `examples/resources/agent-roles/*.yaml`

### ToolPermission

#### `spec`

* `tool_ref` (string): tool name reference.
* `action` (string): action name (commonly `invoke`).
* `required_permissions` (\[]string)
* `match_mode` (string): `all` or `any`
* `apply_mode` (string): `global` or `scoped`
* `target_agents` (\[]string): required when `apply_mode=scoped`
* `operation_rules` (\[]object): per-operation-class policy verdicts.
  * `operation_class` (string): `read`, `write`, `delete`, `admin`, or `*` (wildcard). Defaults to `*`.
  * `verdict` (string): `allow`, `deny`, or `approval_required`. Defaults to `allow`.

#### Defaults and Validation

* `tool_ref` defaults to `metadata.name` when omitted.
* `action` defaults to `invoke`.
* `match_mode` defaults to `all`.
* `apply_mode` defaults to `global`.
* `required_permissions` and `target_agents` are trimmed and deduplicated.
* `target_agents` must be non-empty when `apply_mode=scoped`.
* `operation_rules` values are trimmed and lowercased. Invalid `operation_class` or `verdict` values are rejected.
* When `operation_rules` is present, the authorizer evaluates the tool's `operation_classes` against the rules. The most restrictive matching verdict wins (`deny` > `approval_required` > `allow`).
* When `operation_rules` is empty, behavior is unchanged (backward-compatible binary allow/deny).

#### `status`

* `phase`, `lastError`, `observedGeneration`

Examples: `examples/resources/tool-permissions/*.yaml`

### ToolApproval

Captures a pending human/system approval request for a tool invocation that was flagged by a `ToolPermission` `operation_rules` verdict of `approval_required`.

#### `spec`

* `task_ref` (string, required): name of the Task resource waiting for approval.
* `tool` (string, required): tool name that triggered the approval request.
* `operation_class` (string): the operation class that requires approval.
* `agent` (string): agent that attempted the tool call.
* `input` (string): tool input payload (for audit context).
* `reason` (string): human-readable reason for the approval request.
* `ttl` (duration string): time-to-live before auto-expiry. Defaults to `10m`.

#### `status`

* `phase` (string): `Pending`, `Approved`, `Denied`, `Expired`. Defaults to `Pending`.
* `decision` (string): `approved` or `denied`.
* `decided_by` (string): identity of the approver/denier.
* `decided_at` (string): RFC3339 timestamp of the decision.
* `expires_at` (string): RFC3339 timestamp when the approval expires.

#### API Endpoints

* `POST /v1/tool-approvals` -- create an approval request.
* `GET /v1/tool-approvals` -- list approval requests (supports namespace and label filters).
* `GET /v1/tool-approvals/{name}` -- get a specific approval.
* `DELETE /v1/tool-approvals/{name}` -- delete an approval.
* `POST /v1/tool-approvals/{name}/approve` -- approve a pending request. Body: `{"decided_by": "..."}`.
* `POST /v1/tool-approvals/{name}/deny` -- deny a pending request. Body: `{"decided_by": "..."}`.

### Task

#### `spec`

* `system` (string): target `AgentSystem` name.
* `mode` (string): `run` (default) or `template`.
* `input` (map\[string]string): task payload.
* `priority` (string)
* `max_turns` (int, >= 0): required for cyclic graph traversal.
* `retry` (object):
  * `max_attempts` (int)
  * `backoff` (duration string)
* `message_retry` (object):
  * `max_attempts` (int)
  * `backoff` (duration string)
  * `max_backoff` (duration string)
  * `jitter`: `none`, `full`, `equal`
  * `non_retryable` (\[]string)
* `requirements` (object):
  * `region` (string)
  * `gpu` (bool)
  * `model` (string)

#### Defaults and Validation

* `input` defaults to `{}`.
* `priority` defaults to `normal`.
* `mode` defaults to `run`.
* `mode=template` marks a task as non-executable template for schedules.
* `max_turns` must be `>= 0`.
* `retry` defaults:
  * `max_attempts` -> `1`
  * `backoff` -> `0s`
* `message_retry` defaults:
  * `max_attempts` -> `retry.max_attempts`
  * `backoff` -> `retry.backoff`
  * `max_backoff` -> `24h`
  * `jitter` -> `full`
* `retry.backoff`, `message_retry.backoff`, and `message_retry.max_backoff` must parse as durations.

#### `status`

Primary fields:

* `phase`: `Pending`, `Running`, `WaitingApproval`, `Succeeded`, `Failed`, `DeadLetter`.
* `lastError`, `startedAt`, `completedAt`, `nextAttemptAt`, `attempts`
* `output`, `assignedWorker`, `claimedBy`, `leaseUntil`, `lastHeartbeat`
* `observedGeneration`

The `WaitingApproval` phase indicates the task is paused pending a `ToolApproval` decision. When the linked `ToolApproval` is approved, the task transitions back to `Running`. When denied or expired, the task transitions to `Failed` with an `approval_denied` or `approval_timeout` reason.

Observability arrays:

* `trace[]`: detailed execution/tool-call events.
* `history[]`: lifecycle transitions.
* `messages[]`: message bus records.
* `message_idempotency[]`: message idempotency state.
* `join_states[]`: fan-in join activation state.

Example: `examples/resources/tasks/*.yaml`

### TaskSchedule

#### `spec`

* `task_ref` (string): task template reference (`name` or `namespace/name`).
* `schedule` (string): 5-field cron expression.
* `time_zone` (string): IANA timezone.
* `suspend` (bool): stop triggering when `true`.
* `starting_deadline_seconds` (int): max lateness window for catch-up.
* `concurrency_policy` (string): `forbid` (v1).
* `successful_history_limit` (int): retained successful run count.
* `failed_history_limit` (int): retained failed/deadletter run count.

#### Defaults and Validation

* `task_ref` is required and must be `name` or `namespace/name`.
* `schedule` is required and must be a valid 5-field cron.
* `time_zone` defaults to `UTC`.
* `starting_deadline_seconds` defaults to `300`.
* `concurrency_policy` defaults to `forbid`.
* `successful_history_limit` defaults to `10`.
* `failed_history_limit` defaults to `3`.

#### `status`

* `phase`, `lastError`, `observedGeneration`
* `lastScheduleTime`, `lastSuccessfulTime`, `nextScheduleTime`
* `lastTriggeredTask`, `activeRuns`

Example: `examples/resources/task-schedules/*.yaml`

### TaskWebhook

#### `spec`

* `task_ref` (string): template task reference (`name` or `namespace/name`).
* `suspend` (bool): rejects deliveries when `true`.
* `auth` (object):
  * `profile` (string): `generic` (default) or `github`.
  * `secret_ref` (string): required secret reference (`name` or `namespace/name`).
  * `signature_header` (string)
  * `signature_prefix` (string)
  * `timestamp_header` (string): used by `generic`.
  * `max_skew_seconds` (int): timestamp tolerance for `generic`.
* `idempotency` (object):
  * `event_id_header` (string): header containing unique delivery id.
  * `dedupe_window_seconds` (int): dedupe TTL.
* `payload` (object):
  * `mode` (string): `raw` (v1 only).
  * `input_key` (string): generated task input key for raw payload.

#### Defaults and Validation

* `task_ref` is required and must be `name` or `namespace/name`.
* `auth.secret_ref` is required.
* `auth.profile` defaults to `generic`; supported values: `generic`, `github`.
* profile defaults:
  * `generic`:
    * `signature_header` -> `X-Signature`
    * `signature_prefix` -> `sha256=`
    * `timestamp_header` -> `X-Timestamp`
    * `idempotency.event_id_header` -> `X-Event-Id`
  * `github`:
    * `signature_header` -> `X-Hub-Signature-256`
    * `signature_prefix` -> `sha256=`
    * `idempotency.event_id_header` -> `X-GitHub-Delivery`
* `auth.max_skew_seconds` defaults to `300` and must be `>= 0`.
* `idempotency.dedupe_window_seconds` must be `>= 0`. Defaults to `259200` (72 hours) for `github` profile or `86400` (24 hours) for `generic` profile.
* `payload.mode` defaults to `raw` and only `raw` is allowed in v1.
* `payload.input_key` defaults to `webhook_payload`.

#### `status`

* `phase`, `lastError`, `observedGeneration`
* `endpointID`, `endpointPath`
* `lastDeliveryTime`, `lastEventID`, `lastTriggeredTask`
* `acceptedCount`, `duplicateCount`, `rejectedCount`

Examples: `examples/resources/task-webhooks/*.yaml`

### McpServer

Represents a connection to an external MCP (Model Context Protocol) server. The McpServer controller discovers tools via `tools/list` and auto-generates `Tool` resources (type=mcp) for each.

#### `spec`

* `transport` (string): **required**. `stdio` or `http`.
* `command` (string): stdio transport: command to spawn the MCP server process.
* `args` (\[]string): stdio transport: command arguments.
* `env` (\[]object): stdio transport: environment variables for the child process. Each entry has:
  * `name` (string): environment variable name.
  * `value` (string): literal value.
  * `secretRef` (string): resolve value from a Secret resource. Mutually exclusive with `value`.
* `endpoint` (string): http transport: the MCP server URL.
* `auth` (object): http transport: authentication configuration.
  * `secretRef` (string): secret reference for auth.
  * `profile` (string): `bearer` or `api_key_header`. Defaults to `bearer`.
* `tool_filter` (object): optional tool import filtering.
  * `include` (\[]string): allowlist of MCP tool names. When set, only listed tools are generated. When empty, all discovered tools are generated.
* `reconnect` (object): reconnection policy.
  * `max_attempts` (int): max reconnection attempts. Defaults to 3.
  * `backoff` (duration string): backoff between attempts. Defaults to `2s`.

#### Defaults and Validation

* `transport` is required. Must be `stdio` or `http`.
* `command` is required when `transport=stdio`.
* `endpoint` is required when `transport=http`.
* `env[].secretRef` and `env[].value` are mutually exclusive.
* `reconnect.max_attempts` defaults to `3`.
* `reconnect.backoff` defaults to `2s`.

#### `status`

* `phase`: `Pending`, `Connecting`, `Ready`, `Error`.
* `discoveredTools` (\[]string): all tool names from the MCP server's `tools/list` response.
* `generatedTools` (\[]string): names of the `Tool` resources actually created.
* `lastSyncedAt` (timestamp): last successful tool sync.
* `lastError` (string): last error message.

Guide: [Connect an MCP Server](../guides/connect-mcp-server.md)

Examples: [`examples/resources/mcp-servers/mcp_server_everything_stdio.yaml`](../../../examples/resources/mcp-servers/mcp_server_everything_stdio.yaml), [`examples/resources/mcp-servers/README.md`](../../../examples/resources/mcp-servers/README.md)

### Worker

#### `spec`

* `region` (string)
* `capabilities.gpu` (bool)
* `capabilities.supported_models` (\[]string)
* `max_concurrent_tasks` (int)

#### Defaults and Validation

* `max_concurrent_tasks` defaults to `1` when `<= 0`.

#### `status`

* `phase`, `lastError`, `lastHeartbeat`, `observedGeneration`, `currentTasks`

Example: `examples/resources/workers/worker_a.yaml`

### Related References

* [API Reference](./api.md)
* [Task Scheduling (Cron)](../operations/task-scheduling.md)
* [Webhook Triggers](../operations/webhooks.md)
* [Tool Contract v1](./tool-contract-v1.md)
* [WASM Tool Module Contract v1](./wasm-tool-module-contract-v1.md)
* [Tool Runtime Conformance](../operations/tool-runtime-conformance.md)


## Tool Contract v1

Status: release-candidate contract targeted for Gate 0 stabilization.

### Purpose

Define a consistent execution contract across runtime backends (`http`, `container`, `wasm`, `external`, `grpc`, `webhook-callback`) so policy, retries, auth, and observability behave deterministically.

### Scope

This document defines:

* request/response envelope
* canonical error taxonomy
* capability and risk labels
* auth and redaction requirements
* timeout/cancel semantics
* compatibility expectations

Implemented contract structs are in `runtime/tool_contract.go`.

### Contract Version

* `tool_contract_version`: `v1`
* runtimes must report contract version in execution telemetry
* unknown major versions must be rejected as deterministic non-retryable errors

### Execution Request

```json
{
  "tool_contract_version": "v1",
  "request_id": "req-123",
  "task_id": "default/weekly-report",
  "namespace": "default",
  "agent": "research-agent",
  "tool": {
    "name": "web_search",
    "operation": "invoke",
    "capabilities": ["network.read", "data.read"],
    "risk_level": "medium"
  },
  "input": {
    "query": "latest market analysis"
  },
  "input_raw": "",
  "runtime": {
    "mode": "container",
    "timeout_ms": 15000,
    "max_attempts": 3,
    "backoff": "exponential",
    "max_backoff_ms": 30000,
    "jitter": true
  },
  "auth": {
    "profile": "api_key_header",
    "secret_ref": "search-api-key",
    "scopes": ["search.read"]
  },
  "trace": {
    "trace_id": "trace-abc",
    "span_id": "span-xyz"
  }
}
```

Runtime-required request fields:

* `tool_contract_version` (defaults to `v1` when omitted)
* `request_id`
* `tool.name`

### Execution Response

```json
{
  "request_id": "req-123",
  "status": "ok",
  "output": {
    "summary": "..."
  },
  "usage": {
    "duration_ms": 182,
    "attempt": 1
  },
  "trace": {
    "trace_id": "trace-abc",
    "span_id": "span-xyz"
  }
}
```

`status` values:

* `ok`
* `error`
* `denied`

### Error Model

All failures must include canonical fields:

```json
{
  "status": "error",
  "error": {
    "code": "timeout",
    "reason": "tool_execution_timeout",
    "retryable": true,
    "message": "execution exceeded timeout_ms",
    "details": {
      "timeout_ms": 15000
    }
  }
}
```

Required error fields:

* `code`
* `reason`
* `retryable`
* `message`
* `details`

Canonical `code` values:

* `invalid_input`
* `unsupported_tool`
* `runtime_policy_invalid`
* `isolation_unavailable`
* `permission_denied`
* `secret_resolution_failed`
* `timeout`
* `canceled`
* `execution_failed`

Canonical `reason` values:

* `tool_invalid_input`
* `tool_unsupported`
* `tool_runtime_policy_invalid`
* `tool_isolation_unavailable`
* `tool_permission_denied`
* `tool_secret_resolution_failed`
* `tool_execution_timeout`
* `tool_execution_canceled`
* `tool_backend_failure`

Runtime failures must emit deterministic metadata fields:

* `tool_status`
* `tool_code`
* `tool_reason`
* `retryable`

### Denial Semantics

Policy/permission denials must:

* return `status=denied`
* include normalized reason
* include policy/permission reference when available
* be non-retryable unless explicitly overridden by policy

### Capability Taxonomy

Capability labels are lowercase and dot-delimited:

* `data.read`
* `data.write`
* `network.read`
* `network.write`
* `filesystem.read`
* `filesystem.write`
* `exec.command`
* `external.side_effect`

### Risk Levels

* `low`
* `medium`
* `high`
* `critical`

### Auth Binding

Auth is declarative and secret-referenced via `Tool.spec.auth`:

* `auth.profile`: `bearer` (default), `api_key_header`, `basic`, `oauth2_client_credentials`
* `auth.secretRef`: reference to a `Secret` resource (required when profile is set)
* `auth.headerName`: custom header name (required for `api_key_header`)
* `auth.tokenURL`: OAuth2 token endpoint (required for `oauth2_client_credentials`)
* `auth.scopes[]`: OAuth2 scopes

#### Auth Profiles

| Profile                     | Secret format                           | Injection                                                   |
| --------------------------- | --------------------------------------- | ----------------------------------------------------------- |
| `bearer`                    | Single value (token)                    | `Authorization: Bearer <token>`                             |
| `api_key_header`            | Single value (key)                      | `<headerName>: <key>`                                       |
| `basic`                     | `username:password`                     | `Authorization: Basic <base64>`                             |
| `oauth2_client_credentials` | Multi-key: `client_id`, `client_secret` | Token exchange, then `Authorization: Bearer <access_token>` |

#### Auth Error Codes

| HTTP Status / gRPC Code  | `tool_code`                | `tool_reason`                   | Retryable        |
| ------------------------ | -------------------------- | ------------------------------- | ---------------- |
| 401 / `Unauthenticated`  | `auth_invalid`             | `tool_auth_invalid`             | false            |
| 403 / `PermissionDenied` | `auth_forbidden`           | `tool_auth_forbidden`           | false            |
| Token expired (OAuth2)   | `auth_expired`             | `tool_auth_expired`             | true (one retry) |
| Secret not found         | `secret_resolution_failed` | `tool_secret_resolution_failed` | false            |

#### Approval Error Codes

| Condition                   | `tool_code`        | `tool_reason`           | Retryable     |
| --------------------------- | ------------------ | ----------------------- | ------------- |
| Tool call awaiting approval | `approval_pending` | `tool_approval_pending` | false (pause) |
| Approval denied             | `approval_denied`  | `tool_approval_denied`  | false         |
| Approval TTL expired        | `approval_timeout` | `tool_approval_timeout` | false         |

All approval-related error codes are non-retryable and do not consume retry budget.

#### Rules

* Do not persist resolved secrets in status/logs
* Redact auth values in logs and traces
* Auth resolution failures must map to canonical error fields
* Traces record `tool_auth_profile` and `tool_auth_secret_ref` (secret name, not value)

#### Secret Rotation Semantics

* Secret resolution is performed fresh per tool invocation (no caching of raw secret values)
* If a secret is rotated between invocations, the new value is used on the next call without restart
* For `oauth2_client_credentials`: access tokens are cached with TTL derived from the token response's `expires_in` field (minus a 30-second safety margin). Cache eviction occurs automatically on expiry or on HTTP 401 from the tool endpoint, triggering a fresh token exchange.
* Long-running tasks get the latest secret value on each step's tool calls

### Runtime Semantics

All runtimes must honor:

* `timeout_ms`
* cancellation propagation
* retry attempt index (`usage.attempt`)
* deterministic error mapping
* bounded return on timeout/cancel

#### Transport-Specific Behavior

**`http`** -- Sends raw tool input as the HTTP POST body. Accepts both raw text and `ToolExecutionResponse` JSON in the response. Maps HTTP status codes to canonical errors (429/5xx retryable, 4xx non-retryable).

**`external`** -- Sends the full `ToolExecutionRequest` as the POST body with `Content-Type: application/json` and `X-Tool-Contract-Version: v1`. Requires a `ToolExecutionResponse` JSON response. Non-JSON responses are rejected.

**`grpc`** -- Invokes `orloj.tool.v1.ToolService/Execute` as a unary RPC using a JSON codec. Request and response are `ToolExecutionRequest` and `ToolExecutionResponse` marshaled as JSON. gRPC status codes are mapped to the canonical error taxonomy.

**`webhook-callback`** -- Sends `ToolExecutionRequest` via HTTP POST. A `200` response is treated as immediate completion. A `202` triggers asynchronous polling at `{endpoint}/{request_id}` until a terminal `ToolExecutionResponse` arrives or timeout expires. The runtime also accepts push-based delivery via the callback API.

### Observability Requirements

Each execution must emit:

* start/end timestamps
* terminal status (`ok|error|denied`)
* `error.reason` when failed/denied
* duration and attempt count
* trace correlation (`trace_id`, `span_id`)

Task/message traces should include when available:

* `tool_contract_version`
* `tool_request_id`
* `tool_attempt`
* `error_code`
* `error_reason`
* `retryable`

### Compatibility Policy

* additive fields are preferred
* no unversioned breaking changes on stable contract surfaces
* breaking changes require explicit major versioning and migration guidance


## WASM Tool Module Contract v1

Status: release-candidate contract targeted for Gate 0 stabilization.

This contract defines host-to-guest payloads for wasm-isolated tool execution.

### Scope

* Host runtime: `runtime/tool_runtime_wasm_command_executor.go`
* Guest module: WASI-compatible module that reads request JSON from stdin and writes response JSON to stdout
* Contract version: `v1`

### Request Envelope (stdin)

```json
{
  "contract_version": "v1",
  "namespace": "default",
  "tool": "wasm_echo",
  "input": "{\"query\":\"hello\"}",
  "capabilities": ["wasm.echo.invoke"],
  "risk_level": "high",
  "runtime": {
    "entrypoint": "run",
    "max_memory_bytes": 67108864,
    "fuel": 0,
    "enable_wasi": true
  }
}
```

### Response Envelope (stdout)

#### Success

```json
{
  "contract_version": "v1",
  "status": "ok",
  "output": "result payload"
}
```

#### Error

```json
{
  "contract_version": "v1",
  "status": "error",
  "error": {
    "code": "execution_failed",
    "reason": "tool_backend_failure",
    "retryable": true,
    "message": "guest execution failed",
    "details": {
      "guest_error": "example"
    }
  }
}
```

#### Denied

```json
{
  "contract_version": "v1",
  "status": "denied",
  "error": {
    "code": "permission_denied",
    "reason": "tool_permission_denied",
    "retryable": false,
    "message": "blocked by policy"
  }
}
```

### Validation Rules

* `contract_version` is required and must be `v1`
* `status` must be one of `ok`, `error`, `denied`
* invalid/missing fields are classified as deterministic non-retryable runtime policy errors
* `error` fields map directly into canonical tool error metadata (`tool_code`, `tool_reason`, `retryable`)

### Reference Guest Module

* `examples/resources/tools/wasm-reference/echo_guest.wat`
* `examples/resources/tools/wasm-reference/README.md`

### Related Docs

* [Tool Contract v1](./tool-contract-v1.md)
* [Tool Runtime Conformance](../operations/tool-runtime-conformance.md)


## Backup and Restore

This guide covers backup and restore procedures for Orloj deployments using the Postgres storage backend. Memory-backend deployments are ephemeral and do not require backup.

### What to Back Up

| Component                   | Location                                      | Required               |
| --------------------------- | --------------------------------------------- | ---------------------- |
| Postgres database           | `ORLOJ_POSTGRES_DSN` target                   | Yes                    |
| Secret encryption key       | `ORLOJ_SECRET_ENCRYPTION_KEY` env var or flag | Yes (if secrets exist) |
| Server/worker configuration | Flags, env vars, Kubernetes manifests         | Recommended            |
| Monitoring profiles         | `monitoring/` directory                       | Recommended            |

The secret encryption key is critical. Without it, encrypted `Secret` resource values cannot be decrypted after restore. Store it separately from the database backup in a secure vault.

### Postgres Backup

#### Full Dump

```bash
pg_dump "$ORLOJ_POSTGRES_DSN" \
  --format=custom \
  --file=orloj-backup-$(date +%Y%m%d-%H%M%S).dump
```

`--format=custom` produces a compressed archive that supports selective restore and parallel jobs.

#### Automated Scheduled Backup

For production, schedule backups with cron or your orchestrator's job scheduler:

```bash
# Daily backup with 7-day retention
0 2 * * * pg_dump "$ORLOJ_POSTGRES_DSN" --format=custom \
  --file=/backups/orloj-$(date +\%Y\%m\%d).dump \
  && find /backups -name "orloj-*.dump" -mtime +7 -delete
```

#### Cloud-Managed Databases

If using a managed Postgres service (RDS, Cloud SQL, Azure Database), use the provider's automated backup and point-in-time recovery features instead of `pg_dump`. Ensure the retention window meets your recovery objectives.

### Restore Procedure

#### 1. Stop Orloj Services

Stop `orlojd` and all `orlojworker` instances to prevent writes during restore.

#### 2. Restore the Database

Restore to a fresh database or the existing one:

```bash
# Create fresh database (recommended)
createdb orloj_restored

# Restore from dump
pg_restore --dbname=orloj_restored \
  --clean --if-exists \
  --no-owner \
  orloj-backup-20260317-020000.dump
```

If restoring to the existing database:

```bash
pg_restore --dbname="$ORLOJ_POSTGRES_DSN" \
  --clean --if-exists \
  --no-owner \
  orloj-backup-20260317-020000.dump
```

#### 3. Update DSN (if restored to a new database)

Point `ORLOJ_POSTGRES_DSN` to the restored database before restarting services.

#### 4. Verify the Encryption Key

Ensure `ORLOJ_SECRET_ENCRYPTION_KEY` matches the key that was active when the backup was taken. A mismatched key will cause Secret resource decryption failures at runtime.

#### 5. Restart and Validate

```bash
# Start orlojd
./orlojd --storage-backend=postgres ...

# Verify health
curl -sf http://127.0.0.1:8080/healthz | jq .

# Verify resources are accessible
go run ./cmd/orlojctl get agents
go run ./cmd/orlojctl get tasks
go run ./cmd/orlojctl get workers

# Run a smoke load test
go run ./cmd/orloj-loadtest \
  --base-url=http://127.0.0.1:8080 \
  --tasks=10 \
  --quality-profile=monitoring/loadtest/quality-default.json
```

### Point-in-Time Recovery

For Postgres deployments with WAL archiving enabled, you can recover to a specific point in time. This requires:

1. A base backup taken before the target recovery point.
2. Continuous WAL archiving to a durable location.
3. Postgres `recovery_target_time` configuration.

Refer to the [PostgreSQL PITR documentation](https://www.postgresql.org/docs/current/continuous-archiving.html) for setup details. Cloud-managed databases typically expose PITR as a built-in feature.

### Upgrade Safety

Before any Orloj version upgrade:

1. Take a full Postgres backup.
2. Record the current `ORLOJ_SECRET_ENCRYPTION_KEY`.
3. Record the current binary versions and configuration.
4. Proceed with the upgrade per the [Upgrades and Rollbacks](upgrades.md) guide.

If the upgrade fails, restore from the backup and revert to the previous binary version.

### Disaster Recovery Checklist

* [ ] Postgres backups run on a schedule and are verified periodically.
* [ ] Secret encryption key is stored in a secure vault, separate from backups.
* [ ] Backup retention meets your recovery point objective (RPO).
* [ ] Restore procedure has been tested in a non-production environment.
* [ ] Monitoring alerts cover backup job failures.


## Configuration

This page defines runtime configuration for `orlojd`, `orlojworker`, and client-side defaults for `orlojctl` (see also [CLI reference](../reference/cli.md)).

### Precedence

1. CLI flags
2. Environment variable fallback
3. Code defaults

Example:

* `--model-gateway-provider` overrides `ORLOJ_MODEL_GATEWAY_PROVIDER`.
* If neither is set, default is `mock`.

### Core Environment Variables

| Variable                                     | Used By                 | Purpose                                                                                                                                              |           |                                                                           |
| -------------------------------------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | ------------------------------------------------------------------------- |
| `ORLOJ_POSTGRES_DSN`                         | `orlojd`, `orlojworker` | Postgres DSN when `--storage-backend=postgres`.                                                                                                      |           |                                                                           |
| `ORLOJ_TASK_EXECUTION_MODE`                  | `orlojd`, `orlojworker` | `sequential` or `message-driven`.                                                                                                                    |           |                                                                           |
| `ORLOJ_EMBEDDED_WORKER_MAX_CONCURRENT_TASKS` | `orlojd`                | Default for `--embedded-worker-max-concurrent-tasks` when the embedded worker is enabled (`<= 0` normalized to `1` on upsert).                       |           |                                                                           |
| `ORLOJ_MODEL_GATEWAY_PROVIDER`               | `orlojd`, `orlojworker` | `mock`, `openai`, `anthropic`, `azure-openai`, `ollama`.                                                                                             |           |                                                                           |
| `ORLOJ_MODEL_GATEWAY_API_KEY`                | `orlojd`, `orlojworker` | Explicit model API key.                                                                                                                              |           |                                                                           |
| `OPENAI_API_KEY`                             | `orlojd`, `orlojworker` | Fallback key for OpenAI.                                                                                                                             |           |                                                                           |
| `ANTHROPIC_API_KEY`                          | `orlojd`, `orlojworker` | Fallback key for Anthropic.                                                                                                                          |           |                                                                           |
| `AZURE_OPENAI_API_KEY`                       | `orlojd`, `orlojworker` | Fallback key for Azure OpenAI.                                                                                                                       |           |                                                                           |
| `ORLOJ_EVENT_BUS_BACKEND`                    | `orlojd`                | Server event bus (\`memory                                                                                                                           | nats\`).  |                                                                           |
| `ORLOJ_NATS_URL`                             | `orlojd`, `orlojworker` | NATS URL and fallback for runtime message bus URL.                                                                                                   |           |                                                                           |
| `ORLOJ_AGENT_MESSAGE_BUS_BACKEND`            | `orlojd`, `orlojworker` | Runtime message bus (\`none                                                                                                                          | memory    | nats-jetstream\`).                                                        |
| `ORLOJ_AUTH_MODE`                            | `orlojd`                | API auth mode (\`off                                                                                                                                 | native    | sso`). OSS supports `off`and`native`; `sso\` requires enterprise adapter. |
| `ORLOJ_AUTH_SESSION_TTL`                     | `orlojd`                | Session TTL for native auth mode (example: `24h`).                                                                                                   |           |                                                                           |
| `ORLOJ_SETUP_TOKEN`                          | `orlojd`                | When set, `/v1/auth/setup` requires a matching `setup_token` in the request body. Prevents unauthorized admin account creation on exposed instances. |           |                                                                           |
| `ORLOJ_SECRET_ENCRYPTION_KEY`                | `orlojd`, `orlojworker` | 256-bit AES key (hex or base64) for encrypting Secret resource data at rest.                                                                         |           |                                                                           |
| `ORLOJ_TOOL_ISOLATION_BACKEND`               | `orlojd`, `orlojworker` | Tool isolation (\`none                                                                                                                               | container | wasm\`).                                                                  |
| `OTEL_EXPORTER_OTLP_ENDPOINT`                | `orlojd`, `orlojworker` | OTLP gRPC endpoint for OpenTelemetry trace export. Empty disables export.                                                                            |           |                                                                           |
| `OTEL_EXPORTER_OTLP_INSECURE`                | `orlojd`, `orlojworker` | Set to `true` for non-TLS OTLP connections (development).                                                                                            |           |                                                                           |
| `ORLOJ_LOG_FORMAT`                           | `orlojd`, `orlojworker` | Log output format: `json` (default) or `text`.                                                                                                       |           |                                                                           |
| `ORLOJ_SERVER`                               | `orlojctl`              | Default API base URL when `--server` is omitted (after `ORLOJCTL_SERVER`).                                                                           |           |                                                                           |
| `ORLOJCTL_SERVER`                            | `orlojctl`              | Default API base URL when `--server` is omitted (highest precedence among env defaults).                                                             |           |                                                                           |
| `ORLOJCTL_API_TOKEN`                         | `orlojctl`              | Bearer token for API calls (same semantics as `ORLOJ_API_TOKEN` for the client).                                                                     |           |                                                                           |

### Server Flags

Print full options:

```bash
go run ./cmd/orlojd -h
```

High-impact groups:

* API/server: `--addr`
* storage: `--storage-backend`, `--postgres-dsn`, pool sizing flags
* execution: `--task-execution-mode`, embedded worker/lease controls, `--embedded-worker-max-concurrent-tasks`
* model gateway: provider, API key, timeout, base URL, default model
* tool runtime: isolation mode, container and wasm controls
* buses: server event bus and runtime message bus flags

### Worker Flags

Print full options:

```bash
go run ./cmd/orlojworker -h
```

High-impact groups:

* identity/capacity: `--worker-id`, `--region`, `--gpu`, `--supported-models`, `--max-concurrent-tasks`
* storage: same postgres flags as server
* execution: `--task-execution-mode`, `--agent-message-consume`, runtime consumer controls
* model/tool runtime: provider and isolation flags

### Secret Resolution

Model endpoints and tools reference secrets via `secretRef` fields. The runtime resolves secrets using a chain of resolvers:

1. **Resource store** -- looks up a `Secret` resource by name.
2. **Environment variables** -- looks up `ORLOJ_SECRET_<name>` (prefix is configurable with `--model-secret-env-prefix` and `--tool-secret-env-prefix`).

#### Encryption at Rest

Pass `--secret-encryption-key` (or set `ORLOJ_SECRET_ENCRYPTION_KEY`) on both `orlojd` and `orlojworker` to encrypt `Secret.spec.data` values in the database using AES-256-GCM. The same key must be used by all processes sharing the database. See [Security and Isolation -- Encryption at Rest](./security.md#encryption-at-rest) for key generation and usage.

### Postgres Tuning

#### Connection Pool (main store)

The main Postgres pool is configured via CLI flags:

| Flag                           | Default | Description                                       |
| ------------------------------ | ------- | ------------------------------------------------- |
| `--postgres-max-open-conns`    | 20      | Maximum open connections                          |
| `--postgres-max-idle-conns`    | 10      | Maximum idle connections kept warm                |
| `--postgres-conn-max-lifetime` | 30m     | Maximum lifetime of a connection before recycling |

Idle connections are evicted after 5 minutes to avoid stale TCP connections behind firewalls or load balancers.

#### Connection Pool (pgvector memory backend)

The pgvector memory backend uses a separate `pgxpool` connection pool created from the Memory resource's `spec.endpoint` DSN. Pool behavior can be tuned by appending query parameters to the endpoint URL:

```
postgres://user:pass@host:5432/db?pool_max_conns=10&pool_min_conns=2&pool_max_conn_idle_time=5m&pool_health_check_period=1m
```

| Parameter                  | Default        | Description                                |
| -------------------------- | -------------- | ------------------------------------------ |
| `pool_max_conns`           | max(4, NumCPU) | Maximum pool size                          |
| `pool_min_conns`           | 0              | Minimum warm connections                   |
| `pool_max_conn_lifetime`   | 1h             | Recycle connections after this duration    |
| `pool_max_conn_idle_time`  | 30m            | Close idle connections after this duration |
| `pool_health_check_period` | 1m             | How often to ping idle connections         |

#### Statement Timeout

Neither the main store nor the pgvector backend sets a `statement_timeout` by default. To protect against runaway queries, add it to the DSN:

```bash
# Main store (30-second statement timeout)
--postgres-dsn="postgres://user:pass@host:5432/db?options=-c%20statement_timeout%3D30000"

# pgvector memory endpoint (in the Memory resource spec.endpoint)
postgres://user:pass@host:5432/db?options=-c%20statement_timeout%3D30000&pool_max_conns=10
```

The `statement_timeout` value is in milliseconds. Postgres cancels any single statement that exceeds this limit.

### Recommended Production Baseline

* `orlojd`: `--storage-backend=postgres`, `--task-execution-mode=message-driven`, `--agent-message-bus-backend=nats-jetstream`
* `orlojworker`: `--storage-backend=postgres`, `--task-execution-mode=message-driven`, `--agent-message-consume`
* Enable `--secret-encryption-key` on all processes if using `Secret` resources
* Set provider keys via `ORLOJ_SECRET_*` environment variables or an external secret manager
* Set `OTEL_EXPORTER_OTLP_ENDPOINT` to your tracing backend (Jaeger, Tempo, etc.) for distributed trace collection
* See [Observability](./observability.md) for the full tracing, metrics, and logging setup

### Verification

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
```


## Operations

Use this section to run, secure, troubleshoot, and validate Orloj in production-like environments.

### Core Operator Guides

* [Runbook](./runbook.md)
* [Deployment Overview](../deployment/index.md)
* [Remote CLI and API access](../deployment/remote-cli-access.md)
* [VPS Deployment](../deployment/vps.md)
* [Kubernetes Deployment](../deployment/kubernetes.md)
* [Configuration](./configuration.md)
* [Troubleshooting](./troubleshooting.md)
* [Upgrades and Rollbacks](./upgrades.md)
* [Task Scheduling (Cron)](./task-scheduling.md)
* [Webhook Triggers](./webhooks.md)
* [Security and Isolation](./security.md)

### Observability

* [Observability](./observability.md)
* [Monitoring and Alerts](./monitoring-alerts.md)

### Reliability and Validation

* [Load Testing](./load-testing.md)
* [Tool Runtime Conformance](./tool-runtime-conformance.md)
* [Real Tool Validation](./real-tool-validation.md)
* [Live Validation Matrix](./live-validation-matrix.md)


## Live Validation Matrix

Use this runbook to exercise Orloj with real model providers and a deterministic local tool stub before open source release.

### Purpose

The automated Go test suite proves core correctness, but the live-validation matrix is where we check:

* real provider behavior
* message-driven execution
* tool isolation with real HTTP calls
* memory-backed agent workflows
* governance deny paths
* trigger paths through webhooks and schedules

### Before You Start

1. Run the automated baseline:

```bash
go test ./...
```

2. Start `orlojd`:

```bash
go run ./cmd/orlojd --task-execution-mode=message-driven --agent-message-bus-backend=memory
```

3. Start the worker for your lane:

Anthropic, model-only:

```bash
go run ./cmd/orlojworker \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=memory \
  --agent-message-consume \
  --model-gateway-provider=anthropic
```

Anthropic, tool-backed:

```bash
go run ./cmd/orlojworker \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=memory \
  --agent-message-consume \
  --model-gateway-provider=anthropic \
  --tool-isolation-backend=container \
  --tool-container-network=bridge
```

4. Start the deterministic stub service:

```bash
make real-tool-stub
```

5. Replace all `replace-me` provider secrets in `testing/scenarios-real/`.

Important readiness rule:

* Keep `orlojd` and the matching `orlojworker` running before any `make real-apply-*` or `make real-gate-*` command. If they are not up, tasks can fail immediately or stall.
* Quick check: `curl -sf http://localhost:8080/healthz >/dev/null` should exit 0 before running gates.

### Matrix Overview

#### Wave 0

* `make real-gate-pipeline`
* `make real-gate-hier`
* `make real-gate-loop`
* `make real-gate-tool`
* `make real-gate-tool-decision`

#### Wave 1

* `make real-gate-memory-shared`
* `make real-gate-memory-reuse`

#### Wave 2

* `make real-gate-tool-auth`
* `make real-gate-governance-deny`
* `make real-gate-tool-retry`

#### Wave 3

* `make real-gate-webhook`
* `make real-gate-schedule`

### Contract Enforcement Notes

Scenario `08-tool-auth-and-contract` uses `execution.profile: contract` with `on_contract_violation: observe`. This means:

* The agent's tool sequence is tracked and violations are logged as `agent_contract_violation` events in the task trace.
* Violations do **not** deadletter the task; the agent continues to completion.
* Duplicate tool calls are short-circuited (cached result reused) in all scenarios, including `04-tool-call-smoke` which uses `profile: dynamic`.
* Tool results use the provider's native structured tool calling protocol (`role: "tool"` with `tool_call_id` for OpenAI, `tool_result` content blocks for Anthropic), preventing models from re-calling tools.
* Pipeline stages can use `tool_use_behavior: stop_on_first_tool` to exit immediately after the first successful tool call (1 model call + 1 tool call total).

If a gate deadletters unexpectedly, check whether `on_contract_violation` is set to `non_retryable_error`. Switch to `observe` to collect telemetry without disrupting the flow.

### Acceptance Targets

* Run every Wave 0 and Wave 1 scenario 3 times:

```bash
make real-repeat TARGET=real-gate-pipeline COUNT=3
make real-repeat TARGET=real-gate-memory-shared COUNT=3
```

* Run governance and tool-decision scenarios 5 times:

```bash
make real-repeat TARGET=real-gate-tool-decision COUNT=5
make real-repeat TARGET=real-gate-governance-deny COUNT=5
```

### Deterministic Tool Stub

The local stub service lives at:

* host: `http://127.0.0.1:18080`
* container-accessible: `http://host.docker.internal:18080`

Supported paths:

* `/tool/smoke`
* `/tool/decision`
* `/tool/auth`
* `/tool/retry-once`

This avoids public echo services and gives stable auth/retry assertions.

### Artifact Convention

Every gate captures artifacts under:

```text
testing/artifacts/real/<namespace>/<task>/<timestamp>/
```

Files:

* `task.json`
* `messages.json`
* `metrics.json`
* `memory-<name>.json` for memory-backed scenarios
* `verdict.txt`

### UI Validation Checklist

After a gate passes, inspect `/ui/` and confirm:

* task trace is readable and includes the expected step sequence
* tool calls are visible for tool-backed scenarios
* memory entries are visible on the Memory detail page
* deny/failure paths are understandable without reading source code

### Troubleshooting

* `secret placeholder detected`: replace `replace-me` in the scenario secret.
* `tool container cannot reach stub`: start the worker with `--tool-container-network=bridge` and keep the stub on port `18080`.
* `webhook has not created a task yet`: check the signing secret and confirm the delivery returned HTTP `202`.
* `schedule has not created a task yet`: give the minute-level schedule up to 120 seconds and confirm `orlojd` is reconciling schedules.

### Related

* [Real-Model Scenario README](../../../testing/scenarios-real/README.md)
* [Webhook Triggers](./webhooks.md)
* [Task Scheduling (Cron)](./task-scheduling.md)
* [Real Tool Validation](./real-tool-validation.md)


## Load Testing

Use `orloj-loadtest` to run repeatable reliability scenarios and enforce non-zero quality gates.

### Purpose

The load harness validates message-driven reliability behavior, including retries, dead-letter handling, and lease takeover events.

### Command Reference

```bash
go run ./cmd/orloj-loadtest --help
```

Default behavior:

* applies baseline manifests from `examples/`
* waits for at least two ready workers
* creates tasks concurrently
* waits for terminal state (`succeeded|failed|deadletter`) or timeout
* enforces quality gates (`exit 2` when gate fails)

### Baseline Run

```bash
go run ./cmd/orloj-loadtest \
  --base-url=http://127.0.0.1:8080 \
  --namespace=default \
  --tasks=200 \
  --create-concurrency=25 \
  --poll-concurrency=50 \
  --run-timeout=10m \
  --quality-profile=monitoring/loadtest/quality-default.json
```

### Failure Injection Scenarios

Invalid system dead-letter injection:

```bash
go run ./cmd/orloj-loadtest \
  --tasks=200 \
  --inject-invalid-system-rate=0.10 \
  --invalid-system-name=missing-system-loadtest
```

Retry stress scenario:

```bash
go run ./cmd/orloj-loadtest \
  --tasks=200 \
  --inject-timeout-system-rate=0.20 \
  --timeout-system-name=loadtest-timeout-system \
  --message-retry-attempts=6 \
  --message-retry-backoff=100ms \
  --message-retry-max-backoff=1s \
  --min-retry-total=50
```

Expired lease takeover simulation:

```bash
go run ./cmd/orloj-loadtest \
  --tasks=200 \
  --inject-expired-lease-rate=0.15 \
  --expired-lease-owner=worker-crashed-simulated \
  --min-takeover-events=20
```

### JSON Reporting and Exit Codes

```bash
go run ./cmd/orloj-loadtest --tasks=100 --json=true
```

* `0`: gates passed
* `2`: quality gates failed
* `1`: command/config/runtime failure

### Quality Profiles

#### Default (manual validation)

* `monitoring/loadtest/quality-default.json`

For interactive or staging validation with full failure injection.

#### CI (automated quality gates)

* `monitoring/loadtest/quality-ci.json`

Used by the `reliability` CI job. Runs 30 tasks with relaxed thresholds suitable for memory-backend, mock-model CI environments. No failure injection; validates baseline task flow health.

#### Profile fields

* `min_success_rate`
* `max_deadletter_rate`
* `max_failed_rate`
* `max_timed_out`
* `min_retry_total`
* `min_takeover_events`

### Notes

This harness is for reliability and failure validation, not peak-throughput microbenchmarking.


## Monitoring and Alerts

Use `orloj-alertcheck` and dashboard contracts to validate runtime reliability signals.

> For Prometheus metrics, OpenTelemetry tracing, structured logging, and the trace visualization UI, see [Observability](./observability.md).

### Purpose

This guide defines repeatable operational checks for retry storms, dead-letter growth, and latency saturation.

### Artifacts

* Alert profile (default): `monitoring/alerts/retry-deadletter-default.json`
* Alert profile (CI): `monitoring/alerts/retry-deadletter-ci.json`
* Dashboard contract: `monitoring/dashboards/retry-deadletter-overview.json`
* Alert check command: `cmd/orloj-alertcheck`

The CI profile uses a lower `min_tasks` floor and a higher latency ceiling to accommodate CI runner variability. It is used by the `reliability` job in `.github/workflows/ci.yml`.

### Alert Check Command

```bash
go run ./cmd/orloj-alertcheck \
  --base-url=http://127.0.0.1:8080 \
  --namespace=default \
  --profile=monitoring/alerts/retry-deadletter-default.json \
  --json=true
```

Optional filters:

* `--task-name-prefix`
* `--task-system`

Auth:

* `--api-token=<token>` or `ORLOJ_API_TOKEN=<token>`

### Exit Behavior

* `0`: no violations
* `2`: one or more alert violations found
* `1`: command/config/API failure

### Default Threshold Profile

The default profile checks:

* retry storm absolute total and per-task rate
* dead-letter absolute total and dead-letter task rate
* in-flight saturation ceiling
* max p95 latency ceiling (complement with `orloj_agent_step_duration_seconds` Prometheus histogram for live percentile queries)
* optional `require_any_task_succeeded`

### Dashboard Contract

`monitoring/dashboards/retry-deadletter-overview.json` defines backend-agnostic panel expectations for:

* retry totals
* dead-letter totals
* dead-letter task rate
* in-flight totals
* max p95 latency


## Observability

Orloj provides built-in observability through OpenTelemetry tracing, Prometheus metrics, structured logging, and an in-app trace visualization UI. These features work out of the box in OSS deployments and integrate with standard observability backends.

### Trace Visualization (Web Console)

The web console includes a **Trace** tab on every task detail page. It renders the `TaskTraceEvent` data that the runtime already records during execution.

To view a task trace:

1. Open the web console at `http://<orlojd-address>/ui/`.
2. Navigate to a task and click into its detail page.
3. Click the **Trace** tab.

The trace view shows:

* **Summary bar** -- total events, cumulative latency, token count, tool calls, and error count.
* **Waterfall timeline** -- each row is one trace event (agent start/end, tool call, model call, error, dead-letter). The horizontal bar shows time offset from task start and duration.
* **Filters** -- filter by agent or branch when the task fans out across multiple agents.
* **Expandable detail rows** -- click any row to see step ID, attempt, branch, tool name, tokens, error code/reason, and the full message.

The trace data comes from `GET /v1/tasks/{name}` (the `status.trace` field). No additional backend is required -- trace events are stored alongside the task resource.

### OpenTelemetry Tracing

Orloj emits OpenTelemetry spans for task execution, agent steps, and message processing. Spans are exported via OTLP gRPC to any compatible backend (Jaeger, Grafana Tempo, Datadog, Honeycomb, etc.).

#### Enabling OTel Export

Set the OTLP endpoint via environment variable:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
```

Or for non-TLS backends in development:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true
```

Both `orlojd` and `orlojworker` initialize the OTel trace provider on startup. When no endpoint is configured, a no-op provider is installed and tracing has zero overhead.

#### Span Hierarchy

Spans follow the task execution structure:

```
task.execute (root span)
├── agent.execute (one per agent step)
│   ├── model.call (model gateway invocations)
│   └── tool.execute (tool runtime calls)
└── ...
```

For message-driven execution, each message consumption creates a `message.process` span with a nested `agent.execute` span.

#### Span Attributes

All spans carry `orloj.*` attributes:

| Attribute                | Description                                     |
| ------------------------ | ----------------------------------------------- |
| `orloj.task`             | Task resource name                              |
| `orloj.system`           | AgentSystem resource name                       |
| `orloj.namespace`        | Resource namespace                              |
| `orloj.agent`            | Agent resource name                             |
| `orloj.step_id`          | Step identifier (e.g. `a1.s3`)                  |
| `orloj.attempt`          | Current attempt number                          |
| `orloj.tokens.used`      | Tokens consumed by this step                    |
| `orloj.tokens.estimated` | Estimated tokens (when exact count unavailable) |
| `orloj.tool_calls`       | Number of tool invocations                      |
| `orloj.latency_ms`       | Step duration in milliseconds                   |
| `orloj.message_id`       | Message ID (message-driven mode)                |
| `orloj.from_agent`       | Source agent for message handoff                |
| `orloj.to_agent`         | Destination agent for message handoff           |
| `orloj.branch_id`        | Branch ID for fan-out tracking                  |
| `orloj.tool`             | Tool name                                       |
| `orloj.tool.attempt`     | Tool retry attempt                              |
| `orloj.model`            | Model identifier                                |

#### W3C Trace Context

Orloj propagates `traceparent` and `tracestate` headers using the W3C Trace Context standard. This means external tools that support W3C propagation will automatically appear as child spans in your traces.

#### Dual Write

OTel spans are emitted in parallel with the internal `Task.status.trace` events. The internal trace powers the web console trace tab, while OTel spans flow to your external tracing backend. Both views are consistent.

### Prometheus Metrics

Orloj exposes a standard Prometheus scrape endpoint at `/metrics` on the `orlojd` HTTP server. The endpoint is unauthenticated (like `/healthz`) so Prometheus can scrape it without API tokens.

#### Available Metrics

| Metric                              | Type      | Labels                          | Description                                      |
| ----------------------------------- | --------- | ------------------------------- | ------------------------------------------------ |
| `orloj_task_duration_seconds`       | histogram | `namespace`, `system`, `status` | End-to-end task duration                         |
| `orloj_agent_step_duration_seconds` | histogram | `agent`, `step_type`            | Duration of a single agent step                  |
| `orloj_tokens_used_total`           | counter   | `agent`, `model`, `type`        | Tokens consumed (`type` = `used` or `estimated`) |
| `orloj_messages_total`              | counter   | `phase`, `agent`                | Message lifecycle transitions                    |
| `orloj_deadletters_total`           | counter   | `agent`                         | Messages moved to dead-letter                    |
| `orloj_retries_total`               | counter   | `agent`                         | Message retry count                              |
| `orloj_inflight_messages`           | gauge     | `agent`                         | Currently in-flight messages                     |

#### Prometheus Scrape Configuration

```yaml
scrape_configs:
  - job_name: orloj
    static_configs:
      - targets: ['orlojd:8080']
    metrics_path: /metrics
    scrape_interval: 15s
```

#### Example Queries

Task success rate over the last hour:

```promql
sum(rate(orloj_task_duration_seconds_count{status="succeeded"}[1h]))
/
sum(rate(orloj_task_duration_seconds_count[1h]))
```

Token consumption by agent:

```promql
sum by (agent) (rate(orloj_tokens_used_total{type="used"}[5m]))
```

Dead-letter rate by agent:

```promql
sum by (agent) (rate(orloj_deadletters_total[5m]))
```

P95 agent step latency:

```promql
histogram_quantile(0.95, sum by (le, agent) (rate(orloj_agent_step_duration_seconds_bucket[5m])))
```

### Structured Logging

Both `orlojd` and `orlojworker` emit structured JSON logs by default. Log output can be configured via the `ORLOJ_LOG_FORMAT` environment variable.

#### Configuration

| Variable           | Values         | Default | Description                                          |
| ------------------ | -------------- | ------- | ---------------------------------------------------- |
| `ORLOJ_LOG_FORMAT` | `json`, `text` | `json`  | Log output format. Use `text` for local development. |

#### Log Fields

All log entries include a `service` field (`orlojd` or `orlojworker`). When processing HTTP requests, entries also include:

* `request_id` -- unique ID for the request (propagated from `X-Request-ID` header or auto-generated)

When OpenTelemetry is enabled, log entries from traced code paths include:

* `trace_id` -- OTel trace ID for correlation with spans
* `span_id` -- OTel span ID

#### Request ID Propagation

The HTTP server automatically generates a request ID for each incoming request and returns it in the `X-Request-ID` response header. If the client sends an `X-Request-ID` header, it is reused. This enables end-to-end request correlation across services.

#### Correlating Logs with Traces

In Grafana, you can use the `trace_id` field to link from a log entry directly to the corresponding trace in Tempo or Jaeger. The trace ID in logs matches the OTel trace ID in exported spans.

### Docker Compose Example

To run Orloj with Jaeger and Prometheus in a local development stack:

```yaml
services:
  jaeger:
    image: jaegertracing/jaeger:2
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  orlojd:
    image: orloj:latest
    command: >
      orlojd --embedded-worker
      --model-gateway-provider=openai
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: jaeger:4317
      OTEL_EXPORTER_OTLP_INSECURE: "true"
      ORLOJ_LOG_FORMAT: json
    ports:
      - "8080:8080"
```

With the corresponding `prometheus.yml`:

```yaml
scrape_configs:
  - job_name: orloj
    static_configs:
      - targets: ['orlojd:8080']
    scrape_interval: 15s
```

### CLI Trace Inspection

For operators who prefer the CLI, `orlojctl trace task` prints the full trace timeline:

```bash
go run ./cmd/orlojctl trace task my-task
```

This is useful for quick debugging without opening the web console or an external tracing backend.

### Related Docs

* [Monitoring and Alerts](./monitoring-alerts.md) -- `orloj-alertcheck` threshold profiles and dashboard contracts
* [Configuration](./configuration.md) -- all environment variables and CLI flags
* [Troubleshooting](./troubleshooting.md) -- diagnosis workflows
* [Runbook](./runbook.md) -- production operations
* [Extension Contracts](../reference/extensions.md) -- `MeteringSink` and `AuditSink` for custom integrations


## Real Tool Validation (Model Decision Gate)

Use this runbook to validate model-selected tool usage in an Anthropic-backed A/B scenario.

### Goal

* Task A should use a tool.
* Task B should not use a tool.

Scenario path:

* `testing/scenarios-real/05-tool-decision`

### Before You Begin

1. Add a valid Anthropic key to:

* `testing/scenarios-real/05-tool-decision/secret.yaml`

2. Ensure Docker is available for containerized tools.
3. Ensure API server is reachable at `http://localhost:8080` (or override `API_BASE`).
4. Start the local deterministic stub tool service:

```bash
make real-tool-stub
```

### Runtime Startup

Terminal 1 (server):

```bash
go run ./cmd/orlojd --task-execution-mode=message-driven --agent-message-bus-backend=memory
```

Terminal 2 (worker with Anthropic + containerized tools):

```bash
go run ./cmd/orlojworker \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=memory \
  --agent-message-consume \
  --model-gateway-provider=anthropic \
  --tool-isolation-backend=container \
  --tool-container-network=bridge
```

### Apply Scenario

```bash
make real-apply-tool-decision
```

This applies:

* `ModelEndpoint` pinned to `claude-sonnet-4-20250514`
* one HTTP tool with `spec.runtime.isolation_mode=container`
* one decision agent
* two tasks:
  * `rr-tool-use-task`
  * `rr-tool-no-use-task`

The tool points at the local stub service via `http://host.docker.internal:18080/tool/decision`.

### Run Gate

```bash
make real-gate-tool-decision
```

### Pass/Fail Criteria

#### `rr-tool-use-task`

Pass requires all:

* task phase is `Succeeded`
* `status.output["agent.1.tool_calls"] >= 1`
* at least one `tool_call` event in `status.trace[]`
* `status.output["agent.1.last_event"]` contains `TOOL_USED: yes` and `EVIDENCE:`

#### `rr-tool-no-use-task`

Pass requires all:

* task phase is `Succeeded`
* `status.output["agent.1.tool_calls"] == 0`
* zero `tool_call` events in `status.trace[]`
* `status.output["agent.1.last_event"]` contains `TOOL_USED: no` and `EVIDENCE: self-contained-input`

### Reliability Target

Pass `make real-gate-tool-decision` five consecutive times.

### Manual Inspection

```bash
make real-check NS=rr-real-tool-decision TASK=rr-tool-use-task
make real-check NS=rr-real-tool-decision TASK=rr-tool-no-use-task
```


## Operations Runbook

Use this runbook for baseline production operation and incident response.

### Reference Topology

1. `orlojd` server
2. Postgres state backend
3. NATS JetStream for message-driven execution
4. multiple `orlojworker` instances

### Startup Procedure

1. Start Postgres and NATS.
2. Start `orlojd` with `--storage-backend=postgres` and `--task-execution-mode=message-driven`.
3. Start at least two workers with `--agent-message-consume`.
4. Configure model provider and credentials.
5. Apply required resources (`ModelEndpoint`, `Tool`, `Agent`, `AgentSystem`, `Task`, governance resources).

### Verification

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
curl -s http://127.0.0.1:8080/metrics | head -20
```

Expected result:

* API health endpoint reports healthy.
* Workers report `Ready` and heartbeat updates.
* Tasks transition through expected lifecycle.
* `/metrics` returns Prometheus text output with `orloj_*` metrics.

### Failure and Recovery Expectations

* Worker crash: lease expires and another worker can claim.
* Retry behavior: delayed requeue until success or dead-letter.
* Policy/graph validation failures: non-retryable, deterministic dead-letter.
* Tool runtime denials/errors: normalized metadata in trace/log paths.

### Observability

* Configure `OTEL_EXPORTER_OTLP_ENDPOINT` on both `orlojd` and `orlojworker` for distributed tracing.
* Prometheus scrapes `/metrics` on the `orlojd` HTTP port -- add the target to your Prometheus scrape config.
* Logs are structured JSON by default (`ORLOJ_LOG_FORMAT=json`) with `request_id` and `trace_id` fields.
* The web console Trace tab shows task execution waterfall without any external backend.
* See [Observability](./observability.md) for full setup details.

### Reliability Operations

* Run `go run ./cmd/orloj-loadtest` for repeatable load/failure validation.
* Run `go run ./cmd/orloj-alertcheck` to validate retry/dead-letter thresholds.
* Keep alert and load profile thresholds aligned with SLO targets.

### Related Docs

* [Observability](./observability.md)
* [Deployment Overview](../deployment/index.md)
* [VPS Deployment](../deployment/vps.md)
* [Kubernetes Deployment](../deployment/kubernetes.md)
* [Configuration](./configuration.md)
* [Troubleshooting](./troubleshooting.md)
* [Upgrades and Rollbacks](./upgrades.md)


## Security and Isolation

This page describes current runtime security controls and expected operator practices.

### Current Controls

* `AgentPolicy` gates model/tool/token usage.
* `AgentRole` and `ToolPermission` enforce per-tool authorization.
* Tool runtime enforces timeout/retry/isolation policy from `Tool.spec.runtime`.
* Unsupported tools and disallowed runtime requests fail closed.
* Permission denials are terminal for the current execution path.

### Namespace Isolation

Namespaces are an **organizational boundary**, not a security boundary. Any authenticated user with the correct role (e.g., `reader`, `writer`, `admin`) can access resources in any namespace. There is no per-namespace access control in the OSS build.

For deployments that require per-namespace or per-resource authorization, the server exposes a `ResourceAuthorizer` extension point (see `ServerOptions.ResourceAuthorizer` in `api/auth_context.go`). An enterprise RBAC layer can implement this interface to enforce fine-grained policies based on the caller's identity, the target namespace, resource type, and HTTP method. In the OSS build this hook is nil and all requests that pass the role check are permitted.

### Control plane API tokens

The HTTP API (including `orlojctl`) authenticates automation with **`Authorization: Bearer <token>`** when you enable token validation on the server. Orloj **does not** mint or email API keys: the **operator** chooses a secret string, configures it on `orlojd`, and distributes the **same** value to people and CI that need API access.

**See also:** [Remote CLI and API access](../deployment/remote-cli-access.md) — end-to-end flow for self-hosters (env vars, `orlojctl config`, `config.json` lifecycle).

This is separate from **native UI sign-in** (`--auth-mode=native`), which uses an admin username/password and **session cookies** in the browser. The CLI does not use that password for API calls; use a bearer token as below (or run with auth disabled in trusted dev environments only).

#### 1. Generate a token

Use a cryptographically random value (length is flexible; treat it like a password):

```bash
# Hex (64 characters); easy to paste into env files
openssl rand -hex 32

# Or base64 (~44 characters)
openssl rand -base64 32
```

Store the output in your secrets manager, Kubernetes Secret, or password manager—**not** in git.

#### 2. Configure the server

Pick **one** of these (same token string you generated):

* **Flag:** `orlojd --api-key='<token>'`
* **Environment:** `ORLOJ_API_TOKEN='<token>'` (also read when `--api-key` is unset; see server help)

For **multiple** distinct tokens with different roles (reader vs admin-style access), use:

```bash
export ORLOJ_API_TOKENS='reader-token-here:reader,automation-token-here:admin'
```

Format is comma-separated `token:role` pairs. When `ORLOJ_API_TOKENS` is set, it populates the token map and a single `ORLOJ_API_TOKEN` is only used if that list is empty (see `loadAuthConfig` in `api/authz.go`).

#### 3. Configure clients (`orlojctl` and automation)

Use the **same** token the server expects:

* **Environment:** `ORLOJ_API_TOKEN` or `ORLOJCTL_API_TOKEN`
* **Flag:** `orlojctl --api-token '<token>' ...`
* **Profile:** `orlojctl config set-profile ... --token-env VAR` so the token stays in the environment, not on disk

See [Remote CLI and API access](../deployment/remote-cli-access.md) for client precedence, default `--server` resolution, and profiles.

#### 4. Native auth mode and APIs

If you use `--auth-mode=native`, the UI still requires a bearer token (or session cookie) for protected API routes. Configure `ORLOJ_API_TOKEN` / `--api-key` on the server so `orlojctl` and other API clients can authenticate with `Authorization: Bearer`—the admin password alone is not used for programmatic access.

#### 5. Initial setup protection

When deploying with `--auth-mode=native` on a network-exposed instance, set `ORLOJ_SETUP_TOKEN` to prevent unauthorized admin account creation. When this variable is set, the `/v1/auth/setup` endpoint requires a matching `setup_token` field in the JSON request body:

```json
{
  "username": "admin",
  "password": "...",
  "setup_token": "your-setup-token-here"
}
```

The comparison uses constant-time comparison to prevent timing side-channels. Without `ORLOJ_SETUP_TOKEN`, the setup endpoint is open to the first caller (protected only by rate limiting).

#### 6. Authentication rate limiting

Authentication endpoints (`/v1/auth/login`, `/v1/auth/setup`, `/v1/auth/change-password`, `/v1/auth/admin-reset-password`) are rate-limited per client IP address. The default policy allows 10 requests per minute sustained with a burst of 20 to accommodate legitimate multi-step flows. Requests that exceed the limit receive HTTP 429.

### Tool Types

All tool types (`http`, `external`, `grpc`, `webhook-callback`) flow through the governed runtime pipeline, so policy enforcement, retry, auth injection, and error handling behave identically regardless of transport. See [Tools and Isolation](../concepts/tools-and-isolation.md) for type details.

#### gRPC TLS

gRPC tool connections require TLS (minimum TLS 1.2) by default. Plaintext gRPC is available as an opt-in for development environments only. Production deployments should always use the default TLS transport.

#### SSRF Protection

Outbound HTTP, gRPC, and MCP connections validate the target endpoint before connecting. Requests to the following IP ranges are blocked by default:

* Loopback addresses (`127.0.0.0/8`, `::1`)
* Link-local addresses (`169.254.0.0/16`, `fe80::/10`)
* Cloud metadata endpoints (`169.254.169.254`)

Private network addresses (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `fc00::/7`) are also blocked unless explicitly allowed. These checks apply when the host is a literal IP address; hostname-based endpoints are validated at the network dialer level.

### Isolation Modes

* `none` -- direct execution with real HTTP/gRPC calls (no isolation boundary)
* `sandboxed` -- restricted container with secure defaults (see below)
* `container` -- per-invocation isolated container
* `wasm` -- WebAssembly module with host-guest stdin/stdout boundary

Container backend supports constrained execution for high-risk paths.

WASM backend uses executor-factory boundaries and command-backed runtime execution (default runtime binary `wasmtime`). Invalid wasm runtime configuration fails closed with deterministic non-retryable policy errors.

#### Sandboxed Container Defaults

When `isolation_mode=sandboxed` (the default for `high`/`critical` risk tools), the container backend enforces these security constraints:

| Control              | Value                              |
| -------------------- | ---------------------------------- |
| Filesystem           | `--read-only`                      |
| Linux capabilities   | `--cap-drop=ALL`                   |
| Privilege escalation | `--security-opt no-new-privileges` |
| Network              | `--network none`                   |
| User                 | `65532:65532` (non-root)           |
| Memory               | `128m`                             |
| CPU                  | `0.50` cores                       |
| Process limit        | `64` PIDs                          |

These defaults are enforced by `SandboxedContainerDefaults()` in the runtime and validated by conformance tests. Override with `--tool-container-*` flags only when necessary.

### Secret Handling

Orloj resolves secrets referenced by `secretRef` fields (on ModelEndpoint and Tool resources) using a chain of resolvers, tried in order:

1. **Resource Store** -- looks up a `Secret` resource by name and reads the base64-encoded value from `spec.data`.
2. **Environment Variables** -- looks up `ORLOJ_SECRET_<name>` (configurable prefix via `--model-secret-env-prefix` / `--tool-secret-env-prefix`).

The first resolver that returns a value wins.

#### Development

Use `Secret` resources for local development. The fastest way is the imperative CLI command -- no YAML file needed:

```bash
orlojctl create secret openai-api-key --from-literal value=sk-your-key-here
```

Or with a YAML manifest:

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: openai-api-key
spec:
  stringData:
    value: sk-your-key-here
```

#### Encryption at Rest

When using the Postgres storage backend, enable encryption at rest for `Secret` resources by passing a 256-bit AES key to both `orlojd` and `orlojworker`:

```bash
# Generate a key (hex-encoded, 64 characters)
openssl rand -hex 32

# Pass via flag
orlojd --secret-encryption-key=<hex-key> ...
orlojworker --secret-encryption-key=<hex-key> ...

# Or via environment variable
export ORLOJ_SECRET_ENCRYPTION_KEY=<hex-key>
```

When enabled, all `Secret.spec.data` values are encrypted with AES-256-GCM before being written to the database and decrypted transparently on read. This protects secrets against direct database access, backup exposure, and log/dump leaks.

The key must be identical across all server and worker processes that share the same database. Both hex-encoded (64 characters) and base64-encoded (44 characters) formats are accepted.

**Without** an encryption key, `Secret` data is stored as base64-encoded plaintext in the JSONB payload -- suitable for development but not for production.

#### Production

For production, choose one or both of the following approaches:

**1. Encrypted Secret resources** -- enable `--secret-encryption-key` and continue using `Secret` resources as in development. This is the simplest upgrade path.

**2. Environment variables** -- bypass `Secret` resources entirely by injecting provider keys into the runtime environment:

```bash
export ORLOJ_SECRET_openai_api_key="sk-prod-key"
```

The resolver normalizes the secret name: a `secretRef: openai-api-key` looks up `ORLOJ_SECRET_openai_api_key` (hyphens become underscores).

**3. External secret managers** -- inject secrets as environment variables using your platform's native mechanism:

* **Kubernetes**: Use [external-secrets-operator](https://external-secrets.io/) or the CSI secrets driver to sync Vault/AWS Secrets Manager/GCP Secret Manager values into pod env vars.
* **HashiCorp Vault**: Use [Vault Agent](https://developer.hashicorp.com/vault/docs/agent-and-proxy/agent) sidecar to render secrets into env or files.
* **Cloud providers**: Use AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault with their respective injection mechanisms.

Approaches 2 and 3 do not require `Secret` resources -- the env-var resolver handles resolution directly.

#### API Redaction

The REST API never returns plaintext secret data. All `GET` responses for `Secret` resources replace every value in `spec.data` with `"***"`. This applies to both individual resource fetches and list responses. Secret data is write-only through the API; to verify a secret value, use the resource it references (e.g., test a model endpoint or tool that depends on it).

Event bus messages for secret create/update operations are also redacted before publication.

#### Security Requirements

* Raw secret values must not appear in logs or trace payloads.
* Store the encryption key itself in a secure location (e.g., a KMS, Vault, or hardware security module). Do not commit it to version control.
* Validate redaction behavior during incident drills.

### Tool Auth Profiles

Tools can authenticate using one of four profiles via `spec.auth.profile`:

| Profile                     | Suitable for                                      | Notes                                                                                |
| --------------------------- | ------------------------------------------------- | ------------------------------------------------------------------------------------ |
| `bearer` (default)          | API tokens, service keys                          | Injected as `Authorization: Bearer <token>`                                          |
| `api_key_header`            | APIs using custom header auth (e.g., `X-Api-Key`) | Requires `auth.headerName`                                                           |
| `basic`                     | Legacy HTTP basic auth                            | Secret must be `username:password`                                                   |
| `oauth2_client_credentials` | Machine-to-machine OAuth2                         | Requires `auth.tokenURL`; uses multi-key secret with `client_id` and `client_secret` |

#### Auth in Container Isolation

For container-isolated tools, auth is injected as environment variables rather than HTTP headers. The container's entrypoint script maps these to the appropriate `curl` headers:

| Env Var                                            | Auth Profile                          |
| -------------------------------------------------- | ------------------------------------- |
| `TOOL_AUTH_BEARER`                                 | `bearer`, `oauth2_client_credentials` |
| `TOOL_AUTH_BASIC`                                  | `basic`                               |
| `TOOL_AUTH_HEADER_NAME` + `TOOL_AUTH_HEADER_VALUE` | `api_key_header`                      |

#### Auth Error Handling

Auth failures produce distinct error codes (`auth_invalid` for HTTP 401, `auth_forbidden` for HTTP 403) that are non-retryable. For `oauth2_client_credentials`, a 401 triggers automatic token cache eviction and one retry with a fresh token.

#### Auth Audit Trail

Every tool invocation records `tool_auth_profile` and `tool_auth_secret_ref` (the secret name, not the resolved value) in the task trace. Use these fields for audit queries and compliance reporting.

### Risk-Tier Routing and Approvals

Tools can declare operation classes (`read`, `write`, `delete`, `admin`) via `spec.operation_classes`. Policy rules in `ToolPermission.spec.operation_rules` define per-class verdicts: `allow`, `deny`, or `approval_required`.

When a tool call triggers `approval_required`:

* The task enters `WaitingApproval` phase.
* A `ToolApproval` resource is created for the pending decision.
* An operator approves or denies via the REST API.
* Approval outcomes produce `approval_pending`, `approval_denied`, or `approval_timeout` error codes.

All approval-related outcomes are non-retryable and do not consume retry budget.

#### Operational Guidance

* Use `operation_rules` with `verdict: approval_required` for destructive operations (`delete`, `admin`) in production environments.
* Set appropriate TTLs on `ToolApproval` resources (default: 10 minutes) to prevent tasks from waiting indefinitely.
* Monitor `WaitingApproval` task counts and approval latencies to detect bottlenecks.

### Operational Requirements

* Enforce least-privilege tool permissions.
* Monitor denial and runtime policy error trends.
* Monitor auth failure rates by profile for early detection of expired credentials.
* Monitor approval request volume and response latency for `WaitingApproval` tasks.

### Related Docs

* [Tool Contract v1](../reference/tool-contract-v1.md)
* [Tool Runtime Conformance](./tool-runtime-conformance.md)


## Task Scheduling (Cron)

Use `TaskSchedule` to create recurring run tasks from a task template.

### Purpose

`TaskSchedule` evaluates a 5-field cron expression and creates a new `Task` run from a template task (`spec.mode=template`).

### Before You Begin

* `orlojd` is running (scheduler/controller active).
* At least one worker is available for execution.
* The target `Task` template exists and sets `spec.mode: template`.

### 1. Apply a Task Template

```bash
go run ./cmd/orlojctl apply -f examples/resources/tasks/weekly_report_template_task.yaml
```

Template reference used by schedules:

* `metadata.name`: `weekly-report-template`
* `spec.mode`: `template`

### 2. Apply a Schedule

Example schedule resource:

* [`examples/resources/task-schedules/weekly_report_schedule.yaml`](../../../examples/resources/task-schedules/weekly_report_schedule.yaml)

Apply it:

```bash
go run ./cmd/orlojctl apply -f examples/resources/task-schedules/weekly_report_schedule.yaml
```

Key fields:

* `spec.task_ref`: template task name (`name` or `namespace/name`)
* `spec.schedule`: 5-field cron expression (for example, `0 9 * * 1`)
* `spec.time_zone`: IANA timezone (for example, `America/Chicago`)
* `spec.concurrency_policy`: v1 supports `forbid`
* `spec.starting_deadline_seconds`: lateness window before a missed slot is skipped

### 3. Verify Schedule State

List schedules:

```bash
go run ./cmd/orlojctl get task-schedules
```

Inspect schedule status directly:

```bash
curl -s "http://127.0.0.1:8080/v1/task-schedules/weekly-report?namespace=default" | jq .status
```

Important status fields:

* `nextScheduleTime`
* `lastScheduleTime`
* `lastTriggeredTask`
* `activeRuns`

### 4. Verify Triggered Run Tasks

```bash
go run ./cmd/orlojctl get tasks
```

Generated run tasks are labeled with schedule metadata:

* `orloj.dev/task-schedule`
* `orloj.dev/task-schedule-namespace`
* `orloj.dev/task-schedule-slot`

### Common Controls

* Pause scheduling: set `spec.suspend: true`
* Resume scheduling: set `spec.suspend: false`
* Retention: tune `successful_history_limit` and `failed_history_limit`

### Troubleshooting

* If no tasks are created, verify `spec.task_ref` points to an existing template task.
* If schedule status is `Error`, inspect `.status.lastError` and timezone/cron syntax.
* If runs are skipped, check `starting_deadline_seconds` and `concurrency_policy` behavior.

### Related Docs

* [Resource Reference (`TaskSchedule`)](../reference/resources.md)
* [API Reference](../reference/api.md)
* [Troubleshooting](./troubleshooting.md)


## Tool Runtime Conformance

Status: release-candidate specification for Gate 0 contract stabilization.

### Purpose

Define pass/fail criteria each tool runtime backend must satisfy before production use.

Backends in scope:

* `native`
* `container`
* `wasm`
* `external`

### Required Test Groups

#### 1. Contract Compliance

* accepts `tool_contract_version=v1`
* rejects unknown major versions with deterministic non-retryable error
* preserves `request_id` in response
* returns required fields for `ok`, `error`, and `denied`
* emits canonical failure metadata: `tool_status`, `tool_code`, `tool_reason`, `retryable`

#### 2. Timeout and Cancellation

* enforces `timeout_ms`
* timeout maps to `code=timeout`, `reason=tool_execution_timeout`, `retryable=true`
* honors cancellation signals
* cancellation maps to `code=canceled`, `reason=tool_execution_canceled`

#### 3. Retry Semantics

* deterministic retryable/non-retryable mapping
* preserves `usage.attempt`
* policy/permission denials are non-retryable by default
* preserves explicit `retryable` metadata through wrappers/controllers

#### 4. Auth and Secret Handling

* resolves `auth.secret_ref` from namespace scope
* never logs raw secret values
* emits deterministic auth error code/reason for missing/invalid secrets
* supports auth profiles (`api_key_header`, `bearer_token`, and others)

#### 5. Policy and Permission Hooks

* capability and risk metadata are available for policy checks
* denied calls return `status=denied` with normalized reason
* permission denials map to `permission_denied`, `tool_permission_denied`, `retryable=false`

#### 6. Isolation and Resource Boundaries

* honors configured memory/cpu/pids/network constraints where applicable
* blocks runtime escape attempts with deterministic classification
* documents and tests backend-specific isolation behavior

#### 7. Observability

* emits start/end lifecycle records
* includes trace correlation fields
* records duration and terminal status
* records normalized failure reason for failed/denied calls

#### 8. Determinism and Replay Safety

* idempotent behavior for repeated request delivery with same idempotency key
* consistent failure mapping across repeated runs
* no hidden mutable global state affecting contract semantics

### Exit Criteria

A backend is conformance-ready when:

* all required test groups pass
* known limitations are documented
* policy/auth/observability hooks are verified in integration tests

### Current Harness Implementation

* shared harness: `runtime/conformance/harness.go`
* canonical case catalog: `runtime/conformance/cases/catalog.go`
* base tests: `runtime/conformance/harness_test.go`
* backend registration hooks: `runtime/tool_isolation_backend_registry.go`
* backend suites:
  * `TestGovernedToolRuntimeConformanceSuite`
  * `TestContainerToolRuntimeConformanceSuite`
  * `TestWASMStubRuntimeFailsClosed`
  * `TestWASMRuntimeScaffoldConformanceSuite`
  * `runtime/tool_runtime_wasm_command_executor_test.go`

Run:

```bash
GOCACHE=/tmp/go-build go test ./runtime ./runtime/conformance/...
```


## Troubleshooting

Use this page for deterministic diagnosis and remediation of common failures.

### First Checks

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
```

If these checks fail, inspect `orlojd` and `orlojworker` logs first.

### Common Issues

#### `postgres backend selected but --postgres-dsn is empty`

Cause:

* `--storage-backend=postgres` is set without DSN.

Fix:

```bash
export ORLOJ_POSTGRES_DSN='postgres://orloj:orloj@127.0.0.1:5432/orloj?sslmode=disable'
```

#### Unsupported backend values

Cause:

* invalid value for storage/event/message/tool-isolation backend flags.

Fix:

* storage: `memory|postgres`
* event bus (`orlojd`): `memory|nats`
* runtime message bus: `none|memory|nats-jetstream`
* tool isolation: `none|container|wasm`

#### Workers never claim tasks

Checks:

* worker is `Ready` and heartbeating
* execution mode matches deployment mode
* model provider/auth is valid
* task requirements (`region`, `gpu`, `model`) match worker capabilities

Commands:

```bash
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
go run ./cmd/orlojctl trace task <task-name>
```

#### Message-driven flow not progressing

Cause:

* worker consumer is not enabled.

Fix:

* set `--agent-message-consume`
* set non-`none` `--agent-message-bus-backend`

#### Tool calls fail with permission denials

Cause:

* governance policy denies requested action.

Fix:

* validate `Agent.spec.roles`, `AgentRole`, and `ToolPermission`.
* inspect `tool_code`, `tool_reason`, and `retryable` in trace metadata.

#### Model provider auth failures

Cause:

* missing or invalid provider key.

Fix:

* set `--model-gateway-api-key` or provider env var (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `AZURE_OPENAI_API_KEY`).

#### Wasm/container runtime errors

Cause:

* missing runtime binary/module path or invalid runtime configuration.

Fix:

* verify backend-specific settings (container runtime settings or wasm module/runtime configuration).

### Observability Diagnostics

#### Logs are unstructured or missing request IDs

Cause:

* `ORLOJ_LOG_FORMAT` is not set or binary predates the structured logging migration.

Fix:

* Set `ORLOJ_LOG_FORMAT=json` (default) to emit structured JSON logs with `request_id`, `trace_id`, and `span_id` fields.
* Set `ORLOJ_LOG_FORMAT=text` for human-readable output during local development.

#### Traces not appearing in Jaeger/Tempo

Cause:

* `OTEL_EXPORTER_OTLP_ENDPOINT` is not set or the backend is unreachable.

Fix:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true  # for non-TLS dev backends
```

Restart `orlojd` and `orlojworker`. Verify spans appear in the backend UI.

#### Prometheus `/metrics` returning 404

Cause:

* Running a build that predates the metrics endpoint addition.

Fix:

* Rebuild from the latest source and verify `curl http://127.0.0.1:8080/metrics` returns Prometheus text output.

#### Correlating a log entry with a trace

Use the `trace_id` field from a JSON log entry to search in your tracing backend:

```bash
# Find trace ID in logs
grep '"trace_id"' /var/log/orlojd.log | head -5
```

Then search for that trace ID in Jaeger, Tempo, or the web console Trace tab.

### Escalation Workflow

1. Capture failing command and exact error text.
2. Capture task trace:

```bash
go run ./cmd/orlojctl trace task <task-name>
```

3. Capture recent events:

```bash
go run ./cmd/orlojctl events --once --timeout=30s --raw
```

4. Capture relevant Prometheus metrics (if applicable):

```bash
curl -s http://127.0.0.1:8080/metrics | grep orloj_
```

5. File an issue with logs, trace, metrics, and manifest snippets.


## Upgrades and Rollbacks

This guide defines safe upgrade and rollback procedures for the Orloj server and workers.

### Principles

* prefer staged rollouts over full replacement
* take Postgres backups before upgrades
* validate reliability gates before production promotion
* couple release behavior with contract documentation

### Pre-Upgrade Checklist

* [ ] Read release notes and migration notes.
* [ ] Take a full Postgres backup per the [Backup and Restore](backup-restore.md) guide.
* [ ] Record the current `ORLOJ_SECRET_ENCRYPTION_KEY`.
* [ ] Verify baseline health (`/healthz`, workers, task flow).
* [ ] Run smoke checks in staging.

### Upgrade Procedure

1. Upgrade `orlojd` in staging.
2. Verify API health and resource status.
3. Upgrade one worker (canary).
4. Validate task execution paths used by your deployment.
5. Upgrade remaining workers.
6. Run reliability checks:

* `orloj-loadtest`
* `orloj-alertcheck`

### Production Rollout

* canary one server instance and one worker first
* monitor task success/dead-letter ratio, retry volume, p95 latency, heartbeat stability

### Rollback Triggers

* server health degradation
* retry/dead-letter rates exceed SLO thresholds
* unexpected increase in non-retryable runtime/policy failures

### Rollback Procedure

1. Revert server and worker binaries to previous release.
2. Restore previous configuration values.
3. Restore Postgres from backup if required (see [Backup and Restore](backup-restore.md)).
4. Re-run smoke checks before resuming rollout.

### Compatibility Guidance

* keep compatibility checks green for pinned downstream consumers
* avoid unversioned breaking changes on public contracts
* treat contract graduation and lifecycle changes as release events

### Validation Commands

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
go run ./cmd/orlojctl get tasks
go run ./cmd/orloj-loadtest --quality-profile=monitoring/loadtest/quality-default.json --tasks=50
go run ./cmd/orloj-alertcheck --profile=monitoring/alerts/retry-deadletter-default.json
```


## Webhook Triggers

Use `TaskWebhook` to trigger task runs from signed external HTTP events.

### Purpose

`TaskWebhook` receives inbound deliveries on a generated endpoint, validates signature/auth and idempotency, and creates a run task from a template task.

### Before You Begin

* `orlojd` is running.
* A template task exists with `spec.mode: template`.
* A secret exists for HMAC signing.

### 1. Apply Prerequisites

Apply template task:

```bash
go run ./cmd/orlojctl apply -f examples/resources/tasks/weekly_report_template_task.yaml
```

Apply webhook signing secret:

```bash
go run ./cmd/orlojctl apply -f examples/resources/secrets/webhook_shared_secret.yaml
```

### 2. Apply a Webhook Resource

Generic profile example:

* [`examples/resources/task-webhooks/generic_webhook.yaml`](../../../examples/resources/task-webhooks/generic_webhook.yaml)

GitHub profile example:

* [`examples/resources/task-webhooks/github_push_webhook.yaml`](../../../examples/resources/task-webhooks/github_push_webhook.yaml)

Apply one:

```bash
go run ./cmd/orlojctl apply -f examples/resources/task-webhooks/generic_webhook.yaml
```

### 3. Get the Delivery Endpoint

```bash
curl -s "http://127.0.0.1:8080/v1/task-webhooks/report-generic-webhook?namespace=default" | jq -r '.status.endpointPath'
```

`endpointPath` maps to:

* `POST /v1/webhook-deliveries/{endpoint_id}`

### 4. Send a Signed Test Delivery (Generic Profile)

```bash
BODY='{"event":"report.trigger","topic":"AI startups"}'
TS="$(date +%s)"
SECRET='replace-me'
SIG_HEX="$(printf '%s' "${TS}.${BODY}" | openssl dgst -sha256 -hmac "$SECRET" -binary | xxd -p -c 256)"

curl -i -X POST "http://127.0.0.1:8080$(curl -s "http://127.0.0.1:8080/v1/task-webhooks/report-generic-webhook?namespace=default" | jq -r '.status.endpointPath')" \
  -H "Content-Type: application/json" \
  -H "X-Timestamp: ${TS}" \
  -H "X-Event-Id: evt-001" \
  -H "X-Signature: sha256=${SIG_HEX}" \
  --data "$BODY"
```

Expected response:

* HTTP `202 Accepted`
* JSON with `accepted: true`
* `duplicate: false` on first delivery

### 4b. Send a Signed Test Delivery (GitHub Profile)

```bash
BODY='{"ref":"refs/heads/main","repository":{"full_name":"acme/repo"}}'
SECRET='replace-me'
SIG_HEX="$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$SECRET" -binary | xxd -p -c 256)"

curl -i -X POST "http://127.0.0.1:8080$(curl -s \"http://127.0.0.1:8080/v1/task-webhooks/report-github-push?namespace=default\" | jq -r '.status.endpointPath')" \
  -H "Content-Type: application/json" \
  -H "X-GitHub-Delivery: gh-evt-001" \
  -H "X-Hub-Signature-256: sha256=${SIG_HEX}" \
  --data "$BODY"
```

### 5. Verify Task Creation

```bash
go run ./cmd/orlojctl get tasks
```

Webhook-triggered run tasks include:

* `webhook_payload`
* `webhook_event_id`
* `webhook_received_at`
* `webhook_source`

### Profile Notes

* `generic`: signs `timestamp + "." + rawBody` and checks timestamp skew. Default dedup window is 24 hours.
* `github`: signs raw body and uses GitHub delivery id header defaults. Default dedup window is **72 hours** (vs 24h for generic) because GitHub webhooks do not include a timestamp in the HMAC payload, so replay protection relies entirely on event ID deduplication. The 72-hour window matches GitHub's maximum retry window.

### Rotation and Operations

* Secret rotation: update referenced `Secret`; keep webhook resource unchanged.
* Endpoint rotation: recreate webhook with a new `metadata.name` and update sender URL.
* Duplicate deliveries return `202` with `duplicate: true`.

### Troubleshooting

* `401 signature verification failed`: verify signature algorithm, prefix (`sha256=`), and secret.
* `400 missing event id`: include configured event id header (`X-Event-Id` or `X-GitHub-Delivery`).
* `400 webhook task creation failed`: the webhook was authenticated and deduplicated successfully, but task creation failed (e.g., the referenced task is not a template, or validation failed). The HTTP response returns a generic message; the detailed error is recorded in `status.lastError` on the `TaskWebhook` resource. Inspect it with `orlojctl get task-webhook <name>`.
* `404 webhook endpoint not found`: verify current `.status.endpointPath`.

### Related Docs

* [Task Webhook Examples](../../../examples/resources/task-webhooks/README.md)
* [Resource Reference (`TaskWebhook`)](../reference/resources.md)
* [API Reference (Webhook Delivery)](../reference/api.md)


## Build a Custom Tool

This guide is for developers who need to extend agent capabilities by implementing a custom tool. You will implement the Tool Contract v1, register the tool as a resource, configure isolation and retry, and validate it with the conformance harness.

### Prerequisites

* Orloj server (`orlojd`) and at least one worker running
* `orlojctl` available
* Familiarity with the [Tools and Isolation](../concepts/tools-and-isolation.md) concepts

### What You Will Build

A custom HTTP tool that agents can invoke during execution, registered with Orloj and configured with appropriate runtime controls.

### Step 1: Implement the Tool Contract

Every tool must accept a JSON request envelope and return a JSON response envelope. This is the [Tool Contract v1](../reference/tool-contract-v1.md).

**Request** (sent by the Orloj runtime to your tool):

```json
{
  "request_id": "req-abc-123",
  "tool": "my-custom-tool",
  "action": "invoke",
  "parameters": {
    "query": "example input"
  },
  "auth": {
    "type": "bearer",
    "token": "sk-..."
  },
  "context": {
    "task": "weekly-report",
    "agent": "research-agent",
    "attempt": 1
  }
}
```

**Success response** (returned by your tool):

```json
{
  "request_id": "req-abc-123",
  "status": "success",
  "result": {
    "data": "your tool output here"
  }
}
```

**Error response** (for retryable failures):

```json
{
  "request_id": "req-abc-123",
  "status": "error",
  "error": {
    "tool_code": "rate_limited",
    "tool_reason": "API rate limit exceeded",
    "retryable": true
  }
}
```

The error taxonomy includes `tool_code` (machine-readable), `tool_reason` (human-readable), and `retryable` (boolean). The runtime uses `retryable` to decide whether to retry or move to dead-letter.

### Step 2: Register the Tool

Create a Tool resource manifest:

```yaml
apiVersion: orloj.dev/v1
kind: Tool
metadata:
  name: my-custom-tool
spec:
  type: http
  endpoint: https://your-tool-service.internal/invoke
  capabilities:
    - custom.query.invoke
  operation_classes:
    - read
    - write
  risk_level: medium
  runtime:
    timeout: 10s
    retry:
      max_attempts: 3
      backoff: 1s
      max_backoff: 10s
      jitter: full
  auth:
    secretRef: my-tool-api-key
```

Apply:

```bash
orlojctl apply -f my-custom-tool.yaml
```

#### Field Choices

**`risk_level`** -- Determines the default isolation mode:

* `low` / `medium`: defaults to `none` (direct execution)
* `high` / `critical`: defaults to `sandboxed`

**`operation_classes`** -- Declares the types of operations this tool performs. Valid values: `read`, `write`, `delete`, `admin`. Policy rules in `ToolPermission.operation_rules` can define per-class verdicts (`allow`, `deny`, `approval_required`). When omitted, defaults to `["read"]` for low/medium risk or `["write"]` for high/critical risk.

**`runtime.timeout`** -- How long the runtime waits for your tool to respond before treating the invocation as failed. Choose based on your tool's expected latency.

**`runtime.retry`** -- Configure retry behavior for transient failures. The `jitter: full` setting randomizes backoff intervals to prevent thundering herd effects when multiple agents hit the same tool.

### Step 3: Create a Secret and Configure Auth

If your tool requires authentication, create a Secret and set the auth profile on the Tool. Orloj supports four auth profiles:

#### Bearer token (default)

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: my-tool-api-key
spec:
  stringData:
    value: your-api-key-here
```

```yaml
spec:
  auth:
    secretRef: my-tool-api-key
```

When `profile` is omitted, it defaults to `bearer`. The runtime injects `Authorization: Bearer <token>`.

#### API key header

```yaml
spec:
  auth:
    profile: api_key_header
    secretRef: my-tool-api-key
    headerName: X-Api-Key
```

The runtime injects the secret value as `X-Api-Key: <value>`.

#### Basic auth

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: my-basic-creds
spec:
  stringData:
    value: "username:password"
```

```yaml
spec:
  auth:
    profile: basic
    secretRef: my-basic-creds
```

The secret must contain `username:password`. The runtime base64-encodes it and injects `Authorization: Basic <encoded>`.

#### OAuth2 client credentials

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: my-oauth-creds
spec:
  stringData:
    client_id: your-client-id
    client_secret: your-client-secret
```

```yaml
spec:
  auth:
    profile: oauth2_client_credentials
    secretRef: my-oauth-creds
    tokenURL: https://auth.provider.com/oauth/token
    scopes:
      - read
      - write
```

The runtime exchanges client credentials for an access token, caches it with TTL, and injects `Authorization: Bearer <access_token>`. Tokens are refreshed automatically on expiry or HTTP 401.

Apply:

```bash
orlojctl apply -f my-tool-secret.yaml
orlojctl apply -f my-custom-tool.yaml
```

### Step 4: Grant Agent Access

Add the tool to an agent's `tools` list:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  tools:
    - web_search
    - my-custom-tool
  limits:
    max_steps: 6
    timeout: 30s
```

If governance is enabled, you also need a ToolPermission and an AgentRole that grants the required permissions. See the [governance guide](./setup-governance.md) for details.

### Step 5: Choose a Tool Type

The examples above use `type: http` (the default). If your tool uses a different transport or execution model, set `spec.type` accordingly. The tool type and isolation mode are independent -- any type can run under any isolation mode.

#### External (standalone service)

For tools that need the full Orloj execution context (task, agent, namespace, attempt):

```yaml
spec:
  type: external
  endpoint: https://your-tool-service.internal/execute
  runtime:
    timeout: 30s
```

Your service receives the complete `ToolExecutionRequest` JSON envelope and must return a `ToolExecutionResponse`. This is the right choice when your tool is a dedicated microservice that makes decisions based on who called it and why.

#### gRPC

For tools that expose a gRPC service:

```yaml
spec:
  type: grpc
  endpoint: your-grpc-service:50051
  runtime:
    timeout: 15s
```

Implement the `orloj.tool.v1.ToolService/Execute` unary method. Payloads are the same `ToolExecutionRequest` / `ToolExecutionResponse` envelopes as `external`, marshaled as JSON over gRPC (no protobuf compilation needed).

#### Webhook-Callback (async / long-running)

For tools that take seconds-to-minutes to complete:

```yaml
spec:
  type: webhook-callback
  endpoint: https://your-async-tool.internal/submit
  runtime:
    timeout: 120s
```

Execution flow:

1. Orloj POSTs a `ToolExecutionRequest` to your endpoint.
2. Your tool returns `202 Accepted` to acknowledge receipt.
3. Orloj polls `{endpoint}/{request_id}` at intervals until your tool returns a `ToolExecutionResponse` with a terminal status, or the timeout expires.
4. Alternatively, your tool can push the result to Orloj's callback delivery API instead of waiting for a poll.

Use this for batch processing, CI triggers, human approval workflows, or any tool where the response isn't immediate.

### Step 6: Configure Isolation (Optional)

For tools that run untrusted code or interact with sensitive resources, set an explicit isolation mode. This is independent of tool type.

**Container isolation:**

```yaml
spec:
  runtime:
    isolation_mode: container
    timeout: 15s
```

**WASM isolation** (for tools compiled to WebAssembly):

```yaml
spec:
  type: wasm
  runtime:
    isolation_mode: wasm
    timeout: 5s
```

WASM tools communicate over stdin/stdout using the same JSON envelope. See the [WASM Tool Module Contract v1](../reference/wasm-tool-module-contract-v1.md) for the host-guest communication specification.

**Sandboxed isolation** (secure-by-default container):

```yaml
spec:
  risk_level: high
  runtime:
    isolation_mode: sandboxed
```

Sandboxed mode runs tools in a locked-down container: read-only filesystem, no capabilities, no privilege escalation, no network, non-root user, and strict memory/CPU/pids limits. This is the default for `high` and `critical` risk tools.

### Step 7: Validate with the Conformance Harness

Orloj provides a tool runtime conformance harness that tests your tool against the contract specification. The harness covers eight test groups:

1. **Contract** -- request/response envelope validation
2. **Timeout** -- tool respects configured timeouts
3. **Retry** -- retryable errors trigger retry; non-retryable errors do not
4. **Auth** -- credentials are passed correctly
5. **Policy** -- governance denials are handled properly
6. **Isolation** -- isolation backends enforce boundaries
7. **Observability** -- trace metadata is propagated
8. **Determinism** -- identical inputs produce consistent outputs

See [Tool Runtime Conformance](../operations/tool-runtime-conformance.md) for detailed instructions on running the harness.

### Next Steps

* [Tool Contract v1](../reference/tool-contract-v1.md) -- full contract specification
* [WASM Tool Module Contract v1](../reference/wasm-tool-module-contract-v1.md) -- WASM-specific contract
* [Tools and Isolation](../concepts/tools-and-isolation.md) -- concept deep-dive
* [Tool Runtime Conformance](../operations/tool-runtime-conformance.md) -- running the test harness
* [Connect an MCP Server](./connect-mcp-server.md) -- for MCP-compatible tool servers instead of custom implementations


## Configure Model Routing

This guide is for platform engineers who need to route agents to different model providers. You will set up ModelEndpoints for multiple providers, bind agents to endpoints by reference, and verify that requests route correctly.

### Prerequisites

* Orloj server (`orlojd`) and at least one worker running
* API keys for the providers you want to configure
* `orlojctl` available

### What You Will Build

A multi-provider setup where different agents route to different model providers:

* A research agent using OpenAI's GPT-4o
* A writer agent using Anthropic's Claude

### Step 1: Create Secrets for API Keys

Each provider needs a Secret resource to hold its API key. The fastest way is the CLI -- no YAML file needed:

```bash
orlojctl create secret openai-api-key --from-literal value=sk-your-openai-key-here
orlojctl create secret anthropic-api-key --from-literal value=sk-ant-your-anthropic-key-here
```

Alternatively, use YAML manifests with `orlojctl apply -f`:

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: openai-api-key
spec:
  stringData:
    value: sk-your-openai-key-here
```

> **Production note:** Enable `--secret-encryption-key` on `orlojd` and `orlojworker` to encrypt secret data at rest in the database, or use environment variables (`ORLOJ_SECRET_openai_api_key`) / an external secret manager. See [Security and Isolation](../operations/security.md#secret-handling) for details.

### Step 2: Create Model Endpoints

**OpenAI endpoint:**

```yaml
apiVersion: orloj.dev/v1
kind: ModelEndpoint
metadata:
  name: openai-default
spec:
  provider: openai
  base_url: https://api.openai.com/v1
  default_model: gpt-4o-mini
  auth:
    secretRef: openai-api-key
```

**Anthropic endpoint:**

```yaml
apiVersion: orloj.dev/v1
kind: ModelEndpoint
metadata:
  name: anthropic-default
spec:
  provider: anthropic
  base_url: https://api.anthropic.com/v1
  default_model: claude-3-5-sonnet-latest
  options:
    anthropic_version: "2023-06-01"
    max_tokens: "1024"
  auth:
    secretRef: anthropic-api-key
```

Apply both:

```bash
orlojctl apply -f openai_default.yaml
orlojctl apply -f anthropic_default.yaml
```

Verify they are ready:

```bash
orlojctl get model-endpoints
```

### Step 3: Bind Agents to Endpoints

Use `spec.model_ref` to point each agent at its ModelEndpoint:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  prompt: |
    You are a research assistant.
    Produce concise evidence-backed answers.
  limits:
    max_steps: 6
    timeout: 30s
```

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: writer-agent
spec:
  model_ref: anthropic-default
  prompt: |
    You are a writing agent.
    Produce clear, concise final output from provided research.
  limits:
    max_steps: 4
    timeout: 20s
```

Apply:

```bash
orlojctl apply -f research-agent.yaml
orlojctl apply -f writer-agent.yaml
```

When these agents execute, the model gateway resolves their `model_ref` to the corresponding ModelEndpoint, then constructs provider-specific API requests using the endpoint's `base_url`, `default_model`, `options`, and auth credentials.

### Step 4: Verify Routing

Submit a task that uses these agents and check the logs:

```bash
orlojctl apply -f task.yaml
orlojctl logs task/your-task-name
```

In the task trace, you should see model requests routing to the appropriate providers based on each agent's `model_ref`.

### Adding Azure OpenAI

Azure OpenAI requires an explicit `base_url` and an `api_version` option:

```yaml
apiVersion: orloj.dev/v1
kind: ModelEndpoint
metadata:
  name: azure-openai-default
spec:
  provider: azure-openai
  base_url: https://YOUR_RESOURCE_NAME.openai.azure.com
  default_model: gpt-4o-deployment
  options:
    api_version: "2024-10-21"
  auth:
    secretRef: azure-openai-api-key
```

### Adding Ollama (Local Models)

For local model inference with no API key required:

```yaml
apiVersion: orloj.dev/v1
kind: ModelEndpoint
metadata:
  name: ollama-default
spec:
  provider: ollama
  base_url: http://127.0.0.1:11434
  default_model: llama3.1
```

### Agent Requirement

Agents must set `spec.model_ref` to a valid ModelEndpoint.

### Constraining Models with Policy

To restrict which models agents can use, create an AgentPolicy with `allowed_models`:

```yaml
apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-policy
spec:
  allowed_models:
    - gpt-4o
    - claude-3-5-sonnet-latest
  max_tokens_per_run: 50000
```

Agents configured with models not on this list will be denied at execution time.

### Next Steps

* [Model Routing](../concepts/model-routing.md) -- deeper dive into ModelEndpoint configuration
* [Configuration](../operations/configuration.md) -- environment variables and flags for model gateway setup
* [Build a Custom Tool](./build-custom-tool.md) -- extend agent capabilities with external tools


## Connect an MCP Server

This guide is for platform engineers who want to connect external MCP (Model Context Protocol) servers to Orloj. You will register an MCP server, verify tool discovery, selectively import tools, and assign them to agents.

### Prerequisites

* Orloj server (`orlojd`) running with `--embedded-worker`
* `orlojctl` available (or `go run ./cmd/orlojctl`)
* An MCP server to connect (stdio-based or remote HTTP)

If you have not set up Orloj yet, follow the [Install](../getting-started/install.md) and [Quickstart](../getting-started/quickstart.md) guides first.

### Background

MCP servers are external processes or services that expose tools via the [Model Context Protocol](https://modelcontextprotocol.io/). Unlike regular Orloj tools (which are 1:1 -- one resource, one capability), a single MCP server can provide many tools.

Orloj bridges this gap with the `McpServer` resource kind. When you register an MCP server, the controller:

1. Connects using the configured transport (stdio or Streamable HTTP)
2. Calls `tools/list` to discover available tools
3. Auto-generates a `Tool` resource (type=mcp) for each discovered tool
4. Keeps tools in sync on every reconcile cycle

Generated tools are first-class `Tool` resources. Agents reference them by name just like any other tool.

### Step 1: Register an MCP Server (stdio)

Stdio MCP servers run as child processes. Orloj spawns the process, communicates via stdin/stdout using JSON-RPC 2.0, and manages its lifecycle.

Create a manifest (`github-mcp.yaml`):

```yaml
apiVersion: orloj.dev/v1
kind: McpServer
metadata:
  name: github-mcp
spec:
  transport: stdio
  command: npx @github/mcp-server
  args:
    - "--token-from-env"
  env:
    - name: GITHUB_TOKEN
      secretRef: github-token
```

Create the secret for the token:

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: github-token
spec:
  stringData:
    value: ghp_your_github_token_here
```

Apply both:

```bash
orlojctl apply -f github-token-secret.yaml
orlojctl apply -f github-mcp.yaml
```

### Step 2: Register an MCP Server (HTTP)

Remote MCP servers communicate over HTTP using the Streamable HTTP transport. Use this for MCP servers running as hosted services.

```yaml
apiVersion: orloj.dev/v1
kind: McpServer
metadata:
  name: remote-mcp
spec:
  transport: http
  endpoint: https://mcp.example.com/rpc
  auth:
    secretRef: mcp-api-key
    profile: bearer
```

Apply:

```bash
orlojctl apply -f remote-mcp.yaml
```

### Step 3: Verify Tool Discovery

After applying, check the McpServer status:

```bash
orlojctl get mcp-servers
```

```
NAME          TRANSPORT  STATUS  TOOLS  LAST_SYNCED
github-mcp    stdio      Ready   12     2025-03-18T14:30:00Z
remote-mcp    http       Ready   5      2025-03-18T14:30:05Z
```

List the auto-generated tools:

```bash
orlojctl get tools
```

Each generated tool follows the naming convention `{server}--{mcp-tool-name}`:

```
NAME                           TYPE   STATUS
github-mcp--create-issue       mcp    Ready
github-mcp--search-repos       mcp    Ready
github-mcp--list-prs           mcp    Ready
...
```

Inspect a specific generated tool to see its rich schema:

```bash
orlojctl get tools github-mcp--create-issue -o json
```

The tool's `spec.input_schema` is populated directly from the MCP server's `tools/list` response. The model gateway uses this schema when formatting tool definitions for the LLM, giving it structured parameter information instead of the generic `{input: string}` fallback.

### Step 4: Filter Tools (Optional)

By default, all tools discovered from an MCP server are imported. If a server exposes many tools and you only need a subset, use `spec.tool_filter.include`:

```yaml
apiVersion: orloj.dev/v1
kind: McpServer
metadata:
  name: github-mcp
spec:
  transport: stdio
  command: npx @github/mcp-server
  args:
    - "--token-from-env"
  env:
    - name: GITHUB_TOKEN
      secretRef: github-token
  tool_filter:
    include:
      - create_issue
      - search_repos
```

Only the listed tools will be generated as `Tool` resources. Tools not in the allowlist are still discovered (visible in `status.discoveredTools`) but are not imported.

Re-apply the manifest to update:

```bash
orlojctl apply -f github-mcp.yaml
```

Tools that were previously generated but are no longer in the allowlist are automatically deleted on the next reconcile cycle.

### Step 5: Assign Tools to an Agent

Generated MCP tools are referenced by name in `agent.spec.tools`, exactly like any other tool:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: github-agent
spec:
  model_ref: openai-default
  prompt: |
    You are a GitHub assistant. Use your tools to help
    the user manage issues and search repositories.
  tools:
    - github-mcp--create-issue
    - github-mcp--search-repos
  limits:
    max_steps: 8
    timeout: 60s
```

Apply and submit a task:

```bash
orlojctl apply -f github-agent.yaml
```

When the agent runs, the LLM sees the rich tool schemas from the MCP server and can call tools with structured arguments. The `GovernedToolRuntime` automatically routes `type=mcp` tools through the `MCPToolRuntime`, which sends `tools/call` to the MCP server via the session manager.

### Step 6: Configure Reconnection (Optional)

MCP server connections can drop. The `reconnect` policy controls how aggressively Orloj retries:

```yaml
spec:
  reconnect:
    max_attempts: 5
    backoff: 2s
```

Defaults: 3 attempts with 2s backoff. If all attempts fail, the McpServer enters the `Error` phase. The controller retries on the next reconcile cycle.

### How It Works

The data flow for an MCP tool call:

```
Agent step
  → GovernedToolRuntime (policy, timeout, retry)
    → MCPToolRuntime (resolves mcp_server_ref)
      → McpSessionManager (connection pool)
        → McpTransport (stdio or HTTP)
          → MCP Server (tools/call JSON-RPC 2.0)
```

Key implementation details:

* **Session pooling**: One session per McpServer. Sessions are reused across tool calls and reconcile cycles.
* **Schema propagation**: `spec.input_schema` and `spec.description` from tool discovery flow through to the model gateway, so the LLM gets rich parameter definitions.
* **Garbage collection**: Generated tools carry an `orloj.dev/mcp-server` label. When an MCP server is deleted, all its generated tools are cleaned up.
* **Governance**: MCP tools participate in the full governance pipeline. You can create `ToolPermission` and `AgentRole` resources for them, same as any other tool.

### McpServer Spec Reference

| Field                    | Description                                                                                                             |
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------- |
| `transport`              | **Required**. `stdio` or `http`.                                                                                        |
| `command`                | stdio: command to spawn the MCP server process.                                                                         |
| `args`                   | stdio: command arguments.                                                                                               |
| `env`                    | stdio: environment variables. Each entry has `name`, `value` (literal), or `secretRef` (resolved from Secret resource). |
| `endpoint`               | http: the MCP server URL.                                                                                               |
| `auth.secretRef`         | http: secret for authentication.                                                                                        |
| `auth.profile`           | http: auth profile (`bearer`, `api_key_header`). Defaults to `bearer`.                                                  |
| `tool_filter.include`    | Optional allowlist of MCP tool names to import. When empty, all tools are imported.                                     |
| `reconnect.max_attempts` | Max reconnection attempts. Defaults to 3.                                                                               |
| `reconnect.backoff`      | Backoff duration between attempts. Defaults to `2s`.                                                                    |

#### Status Fields

| Field             | Description                                             |
| ----------------- | ------------------------------------------------------- |
| `phase`           | `Pending`, `Connecting`, `Ready`, or `Error`.           |
| `discoveredTools` | All tool names from `tools/list`, regardless of filter. |
| `generatedTools`  | Tool resource names actually created.                   |
| `lastSyncedAt`    | Timestamp of last successful reconcile.                 |
| `lastError`       | Last error message, if any.                             |

### Next Steps

* [Tools and Isolation](../concepts/tools-and-isolation.md) -- concept deep-dive on tool types and isolation modes
* [Build a Custom Tool](./build-custom-tool.md) -- for non-MCP tools that need custom implementation
* [Set Up Multi-Agent Governance](./setup-governance.md) -- enforce authorization on MCP tools
* [Resource Reference](../reference/resources.md) -- full spec for all resource kinds


## Deploy Your First Pipeline

This guide is for platform engineers who want to run a multi-agent pipeline end-to-end. You will define three agents, wire them into a sequential graph, submit a task, and inspect the results.

### Prerequisites

* Orloj server (`orlojd`) running (sequential mode with `--embedded-worker` is fine for this guide)
* `orlojctl` available (or `go run ./cmd/orlojctl`)

If you have not set up Orloj yet, follow the [Install](../getting-started/install.md) and [Quickstart](../getting-started/quickstart.md) guides first.

### What You Will Build

A three-stage pipeline where each agent hands off to the next:

```
planner ──► researcher ──► writer
```

The planner breaks the task into research requirements, the researcher gathers evidence, and the writer produces the final output.

### Step 1: Define the Agents

Create three agent manifests. Each agent has a model, a system prompt, and execution limits.

**Planner agent** (`planner-agent.yaml`):

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: bp-pipeline-planner-agent
spec:
  model_ref: openai-default
  prompt: |
    You are the planning stage.
    Break the task into concrete research and writing requirements.
  limits:
    max_steps: 4
    timeout: 20s
```

**Research agent** (`research-agent.yaml`):

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: bp-pipeline-research-agent
spec:
  model_ref: openai-default
  prompt: |
    You are the research stage.
    Produce concise, verifiable findings for the writer.
  limits:
    max_steps: 6
    timeout: 30s
```

**Writer agent** (`writer-agent.yaml`):

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: bp-pipeline-writer-agent
spec:
  model_ref: openai-default
  prompt: |
    You are the writing stage.
    Synthesize prior handoffs into a polished final output.
  limits:
    max_steps: 4
    timeout: 20s
```

Apply all three:

```bash
orlojctl apply -f planner-agent.yaml
orlojctl apply -f research-agent.yaml
orlojctl apply -f writer-agent.yaml
```

### Step 2: Define the Agent System

The AgentSystem wires the agents into a pipeline graph:

```yaml
apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: bp-pipeline-system
  labels:
    orloj.dev/pattern: pipeline
spec:
  agents:
    - bp-pipeline-planner-agent
    - bp-pipeline-research-agent
    - bp-pipeline-writer-agent
  graph:
    bp-pipeline-planner-agent:
      edges:
        - to: bp-pipeline-research-agent
    bp-pipeline-research-agent:
      edges:
        - to: bp-pipeline-writer-agent
```

The `graph` field defines a directed acyclic graph. Each node lists its outbound edges. The planner routes to the researcher, who routes to the writer. The writer has no outbound edges, making it the terminal node.

Apply the system:

```bash
orlojctl apply -f agent-system.yaml
```

### Step 3: Submit a Task

Create a task that targets the pipeline system:

```yaml
apiVersion: orloj.dev/v1
kind: Task
metadata:
  name: bp-pipeline-task
spec:
  system: bp-pipeline-system
  input:
    topic: state of enterprise AI copilots
  priority: high
  retry:
    max_attempts: 2
    backoff: 2s
  message_retry:
    max_attempts: 2
    backoff: 250ms
    max_backoff: 2s
    jitter: full
```

Apply the task:

```bash
orlojctl apply -f task.yaml
```

### Step 4: Monitor Execution

Watch the task progress:

```bash
orlojctl get tasks -w
```

View agent logs:

```bash
orlojctl logs task/bp-pipeline-task
```

Trace the execution path through the graph:

```bash
orlojctl trace task bp-pipeline-task
```

Visualize the system graph:

```bash
orlojctl graph system bp-pipeline-system
```

### What Happens at Runtime

1. The scheduler assigns `bp-pipeline-task` to an available worker.
2. The worker claims the task and acquires a lease.
3. The planner agent runs first (entry node -- zero indegree in the graph).
4. The planner's output is routed as a message to the research agent.
5. The research agent processes the message and routes its output to the writer.
6. The writer produces the final output. With no further edges, the task transitions to `Succeeded`.

If any agent fails, the message-level retry configuration kicks in. After `max_attempts` exhaustion, the message moves to `deadletter`. If the task-level retry is also exhausted, the task itself transitions to `DeadLetter`.

### Using the Pre-Built Blueprint

The complete pipeline blueprint is available in the repository:

```bash
orlojctl apply -f examples/blueprints/pipeline/agents/
orlojctl apply -f examples/blueprints/pipeline/agent-system.yaml
orlojctl apply -f examples/blueprints/pipeline/task.yaml
```

### Next Steps

* [Starter Blueprints](../architecture/starter-blueprints.md) -- explore hierarchical and swarm-loop topologies
* [Set Up Multi-Agent Governance](./setup-governance.md) -- add policies and permissions to your pipeline
* [Tasks and Scheduling](../concepts/tasks-and-scheduling.md) -- understand the full task lifecycle


## Guides

Step-by-step tutorials for common Orloj workflows. Each guide walks through a complete use case from start to finish, using real manifests from the `examples/` directory.

For **ready-made scenario folders** (full YAML sets you can copy into your environment), see [examples/use-cases/](../../../examples/use-cases/README.md).

If you have not installed Orloj yet, start with the [Install](../getting-started/install.md) and [Quickstart](../getting-started/quickstart.md) pages first.

### Available Guides

**[Deploy Your First Pipeline](./deploy-pipeline.md)**
*For platform engineers who want to run a multi-agent pipeline end-to-end.*
Walk through the pipeline blueprint: define three agents (planner, researcher, writer), wire them into a sequential graph, submit a task, and inspect the results.

**[Set Up Multi-Agent Governance](./setup-governance.md)**
*For platform engineers who need to enforce tool authorization and model constraints.*
Create policies, roles, and tool permissions. Deploy a governed agent system and verify that unauthorized tool calls are denied.

**[Configure Model Routing](./configure-model-routing.md)**
*For platform engineers who need to route agents to different model providers.*
Set up ModelEndpoints for OpenAI and Anthropic, bind agents to endpoints by reference, and verify that requests route correctly.

**[Connect an MCP Server](./connect-mcp-server.md)**
*For platform engineers who want to integrate MCP-compatible tool servers.*
Register an MCP server (stdio or HTTP), verify tool discovery, filter imported tools, and assign them to agents.

**[Build a Custom Tool](./build-custom-tool.md)**
*For developers who need to extend agent capabilities with external tools.*
Implement the Tool Contract v1, register the tool as a resource, configure isolation and retry, and validate with the conformance harness.


## Set Up Multi-Agent Governance

This guide is for platform engineers who need to enforce tool authorization and model constraints on their agent systems. You will create policies, roles, and tool permissions, deploy a governed agent system, and verify that unauthorized tool calls are denied.

### Prerequisites

* Orloj server (`orlojd`) and at least one worker running
* `orlojctl` available
* Familiarity with the [Governance and Policies](../concepts/governance.md) concepts

### What You Will Build

A governed version of the report pipeline where:

* An AgentPolicy restricts model usage and blocks dangerous tools
* AgentRoles grant specific tool permissions to agents
* ToolPermissions define authorization requirements for each tool
* An agent that attempts an unauthorized tool call is denied

### Step 1: Define the Tools

Start by creating the tools your agents will reference:

```yaml
apiVersion: orloj.dev/v1
kind: Tool
metadata:
  name: web_search
spec:
  type: http
  endpoint: https://api.search.com
  auth:
    secretRef: search-api-key
```

```yaml
apiVersion: orloj.dev/v1
kind: Tool
metadata:
  name: vector_db
spec:
  type: http
  endpoint: https://api.vector-db.local/query
```

Apply both:

```bash
orlojctl apply -f web_search_tool.yaml
orlojctl apply -f vector_db_tool.yaml
```

### Step 2: Create Tool Permissions

Define what permissions are required to invoke each tool:

```yaml
apiVersion: orloj.dev/v1
kind: ToolPermission
metadata:
  name: web-search-invoke
spec:
  tool_ref: web_search
  action: invoke
  match_mode: all
  apply_mode: global
  required_permissions:
    - tool:web_search:invoke
    - capability:web.read
```

```yaml
apiVersion: orloj.dev/v1
kind: ToolPermission
metadata:
  name: vector-db-invoke
spec:
  tool_ref: vector_db
  action: invoke
  match_mode: all
  apply_mode: global
  required_permissions:
    - tool:vector_db:invoke
```

Apply:

```bash
orlojctl apply -f web_search_invoke_permission.yaml
orlojctl apply -f vector_db_invoke_permission.yaml
```

### Step 3: Create Agent Roles

Roles bundle permissions that can be bound to agents:

```yaml
apiVersion: orloj.dev/v1
kind: AgentRole
metadata:
  name: analyst-role
spec:
  description: Can call web search style tools.
  permissions:
    - tool:web_search:invoke
    - capability:web.read
```

```yaml
apiVersion: orloj.dev/v1
kind: AgentRole
metadata:
  name: vector-reader-role
spec:
  description: Can query vector knowledge tools.
  permissions:
    - tool:vector_db:invoke
```

Apply:

```bash
orlojctl apply -f analyst_role.yaml
orlojctl apply -f vector_reader_role.yaml
```

### Step 4: Create an Agent Policy

The policy scopes model and tool constraints to a specific agent system:

```yaml
apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-policy
spec:
  apply_mode: scoped
  target_systems:
    - report-system-governed
  max_tokens_per_run: 50000
  allowed_models:
    - gpt-4o
  blocked_tools:
    - filesystem_delete
```

Apply:

```bash
orlojctl apply -f cost_policy.yaml
```

### Step 5: Deploy the Governed Agent

Create an agent that binds the `analyst-role` but **not** `vector-reader-role`. This means it will be authorized for `web_search` but denied for `vector_db`:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent-governed
spec:
  model_ref: openai-default
  prompt: |
    You are a research assistant.
    Produce concise evidence-backed answers.
  roles:
    - analyst-role
  tools:
    - web_search
    - vector_db
  memory:
    ref: research-memory
  limits:
    max_steps: 6
    timeout: 30s
```

Even though `vector_db` is listed in `tools`, the agent lacks the `tool:vector_db:invoke` permission, so any attempt to call it will be denied.

### Step 6: Wire the System and Submit a Task

```yaml
apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: report-system-governed
spec:
  agents:
    - planner-agent
    - research-agent-governed
    - writer-agent
  graph:
    planner-agent:
      next: research-agent-governed
    research-agent-governed:
      next: writer-agent
```

```bash
orlojctl apply -f report_system_governed.yaml
orlojctl apply -f weekly_report_governed_task.yaml
```

### Step 7: Verify Governance Enforcement

Check the task trace for authorization events:

```bash
orlojctl trace task weekly-report-governed
```

In the trace output, look for:

* Successful `web_search` invocations (agent holds required permissions)
* `tool_permission_denied` errors for any `vector_db` attempts (agent lacks `tool:vector_db:invoke`)

### Allowing Both Tools

To grant the agent access to both tools, add `vector-reader-role` to its roles:

```yaml
spec:
  roles:
    - analyst-role
    - vector-reader-role
```

The pre-built example for this is `examples/resources/agents/research_agent_governed_allow.yaml` with `examples/resources/agent-systems/report_system_governed_allow.yaml`.

### Next Steps

* [Governance and Policies](../concepts/governance.md) -- deeper dive into the authorization model
* [Security and Isolation](../operations/security.md) -- operational security controls
* [Configure Model Routing](./configure-model-routing.md) -- set up provider-specific model endpoints


## Getting Started

Do these in order: install the binaries or run with Docker, then run the quickstart so a real pipeline executes.

1. **[Install](./install.md)** — from source or Docker Compose.
2. **[Quickstart](./quickstart.md)** — one process, embedded worker; you’ll submit a task and see a three-agent pipeline run.

When you’re ready to scale out, the quickstart’s [Scaling to production](./quickstart.md#scaling-to-production) section covers message-driven mode and distributed workers. For more topologies (hierarchical, swarm), see [Starter Blueprints](../architecture/starter-blueprints.md).

**Prerequisites:** Go 1.24+; Docker only if you use container-isolated tools.


## Install Orloj

This guide covers how to install Orloj for local evaluation and production-like use: from source (clone and run or build), from **release binaries** (GitHub Releases), or from **container images** (GitHub Container Registry). Use release artifacts when you want a tagged, published build instead of building from source.

### Before You Begin

* **From source:** Go `1.24+`, optionally Bun `1.3+` for docs/frontend
* **Containers:** Docker
* **API checks:** `curl` and `jq`

***

### From source

Clone the repo, then either run in place or build binaries.

```bash
git clone https://github.com/OrlojHQ/orloj.git && cd orloj
```

#### Run from source (no build)

Single process with embedded worker:

```bash
go run ./cmd/orlojd \
  --storage-backend=memory \
  --task-execution-mode=sequential \
  --embedded-worker \
  --model-gateway-provider=mock
```

#### Build local binaries

```bash
go build -o ./bin/orlojd ./cmd/orlojd
go build -o ./bin/orlojworker ./cmd/orlojworker
go build -o ./bin/orlojctl ./cmd/orlojctl
```

Run the server:

```bash
./bin/orlojd --storage-backend=memory --task-execution-mode=sequential --embedded-worker --model-gateway-provider=mock
```

***

### From release binaries (GitHub Releases)

Download the server, worker, and CLI for your platform from [GitHub Releases](https://github.com/OrlojHQ/orloj/releases). Artifacts are named by binary, git tag, OS, and arch (e.g. `orlojd_v0.1.0_linux_amd64.tar.gz`, `orlojctl_v0.1.0_darwin_arm64.tar.gz`). Verify with `checksums.txt` on the same release. Extract and run:

```bash
# Example: after downloading and extracting orlojd, orlojworker, orlojctl for your OS/arch
./orlojd --storage-backend=memory --task-execution-mode=sequential --embedded-worker --model-gateway-provider=mock
```

Use a specific version tag (e.g. `v0.1.0`) for production.

#### CLI only for hosted deployments

If `orlojd` and workers run elsewhere—Docker Compose on a VPS, Kubernetes, GHCR images, or a managed host—you **do not** need the full repo on your laptop. Download only the **`orlojctl_*_<os>_<arch>`** archive for your platform from the same [GitHub Releases](https://github.com/OrlojHQ/orloj/releases) page, verify it with `checksums.txt`, extract the binary, and put it on your `PATH` (or run it by full path). Container images ship the server and worker binaries only, not the CLI.

Point `orlojctl` at your API with `--server` and authenticate with a bearer token; see [Remote CLI and API access](../deployment/remote-cli-access.md). Prefer a **CLI version that matches your server’s release tag** when possible.

***

### From container images (GHCR)

Published releases are pushed to GitHub Container Registry. Pull and run the server and worker without building from source:

```bash
docker pull ghcr.io/orlojhq/orloj-orlojd:latest
docker pull ghcr.io/orlojhq/orloj-orlojworker:latest
```

Use a version tag instead of `latest` for production (e.g. `ghcr.io/orlojhq/orloj-orlojd:v0.1.0`). You still need Postgres and optionally NATS for persistence and message-driven mode; see [Deployment](../deployment/index.md) for full-stack options. Example, server only with in-memory storage:

```bash
docker run --rm -p 8080:8080 ghcr.io/orlojhq/orloj-orlojd:latest \
  --addr=:8080 \
  --storage-backend=memory \
  --task-execution-mode=sequential \
  --embedded-worker \
  --model-gateway-provider=mock
```

For a full stack (Postgres, NATS, server, workers), use the [VPS](../deployment/vps.md) or [Kubernetes](../deployment/kubernetes.md) deployment guides with `image: ghcr.io/orlojhq/orloj-orlojd:<tag>` (and the worker image) instead of building from the repo.

***

### Docker Compose (from source)

To run the full stack from the repo (Postgres, NATS, `orlojd`, two workers) with a local build:

```bash
git clone https://github.com/OrlojHQ/orloj.git && cd orloj
docker compose up --build
```

This builds the server and worker images from the Dockerfile. To use release images instead, override the service images to `ghcr.io/orlojhq/orloj-orlojd:<tag>` and `ghcr.io/orlojhq/orloj-orlojworker:<tag>` (see [Deployment](../deployment/index.md)).

### Verify Installation

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
```

Expected result:

* `healthz` returns healthy status.
* At least one worker is `Ready`.

### Next Steps

* [Deployment Overview](../deployment/index.md)
* [Local Deployment](../deployment/local.md)
* [VPS Deployment](../deployment/vps.md)
* [Kubernetes Deployment](../deployment/kubernetes.md)
* [Quickstart](./quickstart.md)
* [Configuration](../operations/configuration.md)


## Quickstart

Get a multi-agent pipeline running in under five minutes. This quickstart uses sequential execution mode -- the simplest way to run Orloj with a single process and no external dependencies.

### Before You Begin

* Go `1.24+` is installed.
* You are in repository root.

### 1. Start the Server

Start `orlojd` with an embedded worker in sequential mode:

```bash
go run ./cmd/orlojd \
  --storage-backend=memory \
  --task-execution-mode=sequential \
  --embedded-worker \
  --model-gateway-provider=mock
```

This runs the server and a built-in worker in a single process. No separate worker needed.

**Web console:** Open [http://127.0.0.1:8080/ui/](http://127.0.0.1:8080/ui/) in your browser to view agents, systems, tasks, and the task trace. You can use it to inspect the pipeline and task status as you run the steps below.

### 2. Apply a Starter Blueprint

```bash
go run ./cmd/orlojctl apply -f examples/blueprints/pipeline/
```

This creates agents, an agent system (the pipeline graph), and a task in one command.

### 3. Verify Execution

```bash
go run ./cmd/orlojctl get task bp-pipeline-task
```

Expected result: task reaches `Succeeded`.

### Scaling to Production

When you are ready to run multi-worker, distributed workloads, switch to **message-driven** mode. This unlocks parallel fan-out, durable message delivery, and horizontal scaling.

Start the server:

```bash
go run ./cmd/orlojd \
  --storage-backend=postgres \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=nats-jetstream
```

Start one or more workers:

```bash
go run ./cmd/orlojworker \
  --storage-backend=postgres \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=nats-jetstream \
  --agent-message-consume \
  --model-gateway-provider=openai
```

See [Execution and Messaging](../architecture/execution-model.md) for details on the message lifecycle, ownership guarantees, and retry behavior.

### Try with a Real Model

The quickstart above uses the mock gateway, which returns placeholder output. To use a real provider (OpenAI, Anthropic, Ollama, etc.), create a **Secret** resource for your API key, create a **ModelEndpoint** that references it via `auth.secretRef`, and point your agents at that endpoint with `model_ref`. See [Configure Model Routing](../guides/configure-model-routing.md) for the full steps.

### Next Steps

* [Starter Blueprints](../architecture/starter-blueprints.md) -- pipeline, hierarchical, and swarm-loop topologies
* [Configuration](../operations/configuration.md) -- all flags and environment variables


## Deployment Overview

This section provides setup runbooks by deployment target.

### Purpose

Choose a deployment path based on your environment and required operational behavior.

### Deployment Targets

| Target     | Best For                                        | Persistence                     | Process Management         | Scope                         |
| ---------- | ----------------------------------------------- | ------------------------------- | -------------------------- | ----------------------------- |
| Local      | Development and rapid iteration                 | Optional (`memory` or Postgres) | terminal or Docker Compose | single operator machine       |
| VPS        | Single-node production-style self-hosting       | Postgres volume                 | systemd + Docker Compose   | small internal workloads      |
| Kubernetes | Cluster-based operations and lifecycle controls | PVC-backed Postgres             | Kubernetes deployments     | platform-managed environments |

### Runbooks

1. [Local Deployment](./local.md)
2. [VPS Deployment (Compose + systemd)](./vps.md)
3. [Kubernetes Deployment (Helm + Manifest Fallback)](./kubernetes.md)
4. [Remote CLI and API access](./remote-cli-access.md) — tokens, `orlojctl` profiles, and `config.json` after you expose the control plane

### Hosted stack, local CLI

When the control plane runs in Compose, Kubernetes, or GHCR images, install **`orlojctl` alone** on the machine you use to operate the cluster (laptop, bastion, or CI): download the `orlojctl_*` archive for your OS and arch from [GitHub Releases](https://github.com/OrlojHQ/orloj/releases) (see [Install: CLI only for hosted deployments](../getting-started/install.md#cli-only-for-hosted-deployments)). Then follow [Remote CLI and API access](./remote-cli-access.md) for `--server`, tokens, and optional profiles.

### Security Defaults

* Rotate default secrets before non-local use.
* Restrict network exposure to required interfaces.
* Keep API auth strategy explicit for each target.
* After deployment, configure [remote CLI access](./remote-cli-access.md) (API tokens, env vars, optional `orlojctl config` profiles).

### Related Docs

* [Install](../getting-started/install.md)
* [Operations Runbook](../operations/runbook.md)


## Kubernetes Deployment (Helm + Manifest Fallback)

### Purpose

Deploy Orloj on Kubernetes with a Helm chart (recommended) or with raw manifests (fallback).

### Prerequisites

* Kubernetes cluster access (`kubectl` context configured)
* container registry you can push to
* Docker (or compatible image builder)
* Helm 3 (`helm`)
* `curl`, `jq`, and `go` for CLI verification from operator workstation

### Install

#### 1. Build and Push Images

```bash
export REGISTRY=ghcr.io/<your-org-or-user>
export TAG=v0.1.0

docker build -t "${REGISTRY}/orloj-orlojd:${TAG}" --target orlojd \
  --build-arg "VERSION=${TAG}" --build-arg "COMMIT=$(git rev-parse HEAD)" --build-arg "DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)" .
docker build -t "${REGISTRY}/orloj-orlojworker:${TAG}" --target orlojworker \
  --build-arg "VERSION=${TAG}" --build-arg "COMMIT=$(git rev-parse HEAD)" --build-arg "DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)" .
docker push "${REGISTRY}/orloj-orlojd:${TAG}"
docker push "${REGISTRY}/orloj-orlojworker:${TAG}"
```

#### 2. Install with Helm (Recommended)

```bash
helm upgrade --install orloj ./charts/orloj \
  --namespace orloj \
  --create-namespace \
  --set orlojd.image.repository="${REGISTRY}/orloj-orlojd" \
  --set orlojd.image.tag="${TAG}" \
  --set orlojworker.image.repository="${REGISTRY}/orloj-orlojworker" \
  --set orlojworker.image.tag="${TAG}" \
  --set postgres.auth.password='<strong-password>' \
  --set runtimeSecret.modelGatewayApiKey='<model-provider-api-key>'
```

To inspect effective values:

```bash
helm get values orloj --namespace orloj
```

#### 3. Manifest Fallback (No Helm)

If you cannot use Helm, apply the baseline manifest set:

1. Edit `docs/deploy/kubernetes/orloj-stack.yaml` image references.
2. Rotate baseline secrets (`postgres-password`, DSN password, model API key).
3. Apply manifests:

```bash
kubectl apply -f docs/deploy/kubernetes/orloj-stack.yaml
```

### Verify

Wait for rollouts:

```bash
kubectl -n orloj rollout status deploy/orloj-postgres
kubectl -n orloj rollout status deploy/orloj-nats
kubectl -n orloj rollout status deploy/orloj-orlojd
kubectl -n orloj rollout status deploy/orloj-orlojworker
```

If you used manifest fallback instead of Helm, use:

```bash
kubectl -n orloj rollout status deploy/postgres
kubectl -n orloj rollout status deploy/nats
kubectl -n orloj rollout status deploy/orlojd
kubectl -n orloj rollout status deploy/orlojworker
```

Port-forward API service:

```bash
kubectl -n orloj port-forward svc/orloj-orlojd 8080:8080
```

For manifest fallback, port-forward `svc/orlojd` instead.

In another terminal:

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl --server http://127.0.0.1:8080 get workers
go run ./cmd/orlojctl --server http://127.0.0.1:8080 apply -f examples/blueprints/pipeline/
go run ./cmd/orlojctl --server http://127.0.0.1:8080 get task bp-pipeline-task
```

Done means:

* all deployments are successfully rolled out.
* API service is reachable through port-forward.
* at least one worker is `Ready`.
* sample task reaches `Succeeded`.

### Operate

Scale workers:

```bash
kubectl -n orloj scale deploy/orloj-orlojworker --replicas=3
kubectl -n orloj rollout status deploy/orloj-orlojworker
```

Restart control plane:

```bash
kubectl -n orloj rollout restart deploy/orloj-orlojd
kubectl -n orloj rollout status deploy/orloj-orlojd
```

View logs:

```bash
kubectl -n orloj logs deploy/orloj-orlojd --tail=200
kubectl -n orloj logs deploy/orloj-orlojworker --tail=200
```

Upgrade chart release:

```bash
helm upgrade orloj ./charts/orloj --namespace orloj
```

### Troubleshoot

* pods in `ImagePullBackOff`: verify image names/tags and registry access.
* workers not processing: verify `ORLOJ_AGENT_MESSAGE_CONSUME=true` and message-bus env values.
* tasks not created: verify port-forward is active and API endpoint is reachable.
* Helm rollback: `helm rollback orloj <revision> --namespace orloj`.

### Security Defaults

* This baseline is not HA.
* Rotate secrets before non-test use.
* `ORLOJ_AUTH_MODE` defaults to `native` in chart runtime config.
* Set and rotate `runtimeSecret.apiToken` for CLI/automation bearer auth.
* Restrict namespace and service exposure based on cluster policy.

### Related Docs

* [Deployment Assets (`docs/deploy/kubernetes`)](../../deploy/kubernetes/README.md)
* [Configuration](../operations/configuration.md)
* [Operations Runbook](../operations/runbook.md)


## Local Deployment

### Purpose

Run Orloj locally for development and deterministic feature validation.

### Prerequisites

* Go `1.24+`
* Docker
* `curl` and `jq`
* repository checked out locally

### Install

#### Option A: Run from Source

Terminal 1:

```bash
go run ./cmd/orlojd \
  --storage-backend=memory \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=memory
```

Terminal 2:

```bash
go run ./cmd/orlojworker \
  --storage-backend=memory \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=memory \
  --agent-message-consume \
  --model-gateway-provider=mock
```

#### Option B: Docker Compose Stack

```bash
docker compose up --build -d
```

Uses [`docker-compose.yml`](../../../docker-compose.yml).

### Verify

Health and worker readiness:

```bash
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
```

Sample task execution:

```bash
go run ./cmd/orlojctl apply -f examples/blueprints/pipeline/
go run ./cmd/orlojctl get task bp-pipeline-task
```

Done means:

* `/healthz` returns healthy status.
* at least one worker is `Ready`.
* `bp-pipeline-task` reaches `Succeeded`.

### Operate

Source-mode stop: terminate both processes.

Compose-mode stop:

```bash
docker compose down
```

Compose logs:

```bash
docker compose logs -f orlojd orlojworker-a orlojworker-b
```

### Troubleshoot

* If workers do not appear, check worker process logs for DSN/backend mismatch.
* If tasks remain pending, verify execution mode and message-consumer flags match.
* If `orlojctl` calls fail, confirm `--server` points to the active API address.

### Security Defaults

For local-only use, API auth may remain disabled. Do not expose local ports publicly.

### Related Docs

* [Quickstart](../getting-started/quickstart.md)
* [Configuration](../operations/configuration.md)
* [Troubleshooting](../operations/troubleshooting.md)


## Remote CLI and API access

This guide is for **operators and users** who already have `orlojd` reachable on a network (self-hosted, VPS, Kubernetes, or internal URL) and need to call the API from **`orlojctl`**, scripts, or CI. It complements the [quickstart](../getting-started/quickstart.md), which focuses on a single-machine dev loop.

For deeper security context (generation, rotation, threat model), see [Control plane API tokens](../operations/security.md#control-plane-api-tokens).

### Install `orlojctl` locally

You need the CLI on **your** machine (or in CI), not inside the server container. The easiest path is the standalone binary from [GitHub Releases](https://github.com/OrlojHQ/orloj/releases): download **`orlojctl_<tag>_<os>_<arch>`**, verify with `checksums.txt` on that release, extract, and add the binary to your `PATH`. Details and naming conventions are in [Install: CLI only for hosted deployments](../getting-started/install.md#cli-only-for-hosted-deployments). If you already cloned the repo with Go installed, `go run ./cmd/orlojctl` works the same way against a remote `--server`.

### API tokens (shared secret)

Orloj **does not** issue API tokens from the web console. The **operator** generates a random string, configures **the same value** on the server and on every client that uses `Authorization: Bearer <token>`.

```bash
openssl rand -hex 32
```

Store the value in your secrets manager or deployment environment—**not** in git.

On the server, set **`orlojd --api-key=...`** or **`ORLOJ_API_TOKEN=...`** (or **`ORLOJ_API_TOKENS`** for multiple `token:role` pairs). See [Control plane API tokens](../operations/security.md#control-plane-api-tokens) for details.

### Server-side wiring

Where you set `ORLOJ_API_TOKEN` depends on how you run `orlojd`:

* **Docker Compose / systemd** — env var or secret in the service definition (e.g. [VPS deployment](./vps.md)).
* **Kubernetes / Helm** — `runtimeSecret` or equivalent env injection (see [Kubernetes deployment](./kubernetes.md)).

### Client-side: environment and flags

From any machine that should talk to the API:

| Mechanism                      | Purpose                                                 |
| ------------------------------ | ------------------------------------------------------- |
| `ORLOJ_SERVER`                 | Default API base URL when `--server` is omitted         |
| `ORLOJCTL_SERVER`              | Same default; **takes precedence** over `ORLOJ_SERVER`  |
| `ORLOJ_API_TOKEN`              | Bearer token                                            |
| `ORLOJCTL_API_TOKEN`           | Same token; checked before `ORLOJ_API_TOKEN` by the CLI |
| `orlojctl --api-token <token>` | Overrides env for that process                          |
| `orlojctl --server <url>`      | Overrides per-command default server                    |

### Precedence

**Token** (first match wins):

1. `orlojctl --api-token ...`
2. `ORLOJCTL_API_TOKEN`
3. `ORLOJ_API_TOKEN`
4. Active profile: `token` field, else value of the env var named by `token_env`

**Default `--server`** when the flag is omitted (first match wins):

1. `ORLOJCTL_SERVER`
2. `ORLOJ_SERVER`
3. Active profile `server`
4. `http://127.0.0.1:8080`

Explicit `--server` on a subcommand always overrides the default above.

### `orlojctl config` and `config.json`

Named **profiles** are stored as JSON:

* **Path:** `orlojctl config path` (typically `~/.config/orlojctl/config.json` on Unix).
* **Permissions:** file is written with mode `0600` when created or updated.

**The file does not exist until the first successful save** (for example `orlojctl config set-profile <name> ...`). Until then, only environment variables and flags apply—if you open the path early, an empty or missing file is normal.

Commands:

```bash
orlojctl config path
orlojctl config set-profile production --server https://orloj.example.com --token-env ORLOJ_PROD_TOKEN
orlojctl config use production
orlojctl config get
```

`set-profile` creates or updates a profile. The first profile you create also becomes **`current_profile`** if none was set. Prefer **`--token-env`** so the token is not stored in the JSON file.

#### Example `config.json`

Shape matches the CLI (field names are JSON):

```json
{
  "current_profile": "production",
  "profiles": {
    "local": {
      "server": "http://127.0.0.1:8080"
    },
    "production": {
      "server": "https://orloj.example.com",
      "token_env": "ORLOJ_PROD_TOKEN"
    }
  }
}
```

You can hand-edit this file if you prefer; invalid JSON will cause `orlojctl` to error on load.

### Local UI auth vs API tokens

If you use **`--auth-mode=native`**, the web UI uses an **admin username/password** and **session cookies**. That is separate from API access: **`orlojctl` and automation should use the bearer token** configured with `ORLOJ_API_TOKEN` / `--api-key` on the server, not the UI password. See [Control plane API tokens](../operations/security.md#control-plane-api-tokens) and [CLI reference: orlojctl](../reference/cli.md#orlojctl).

### Related docs

* [CLI reference](../reference/cli.md) — full command list and flags
* [Configuration](../operations/configuration.md) — `orlojd` / `orlojworker` environment variables
* [VPS deployment](./vps.md) — single-node Compose + systemd
* [Kubernetes deployment](./kubernetes.md) — Helm and manifests


## VPS Deployment (Compose + systemd)

### Purpose

Run Orloj on a single VPS with Docker Compose managed by systemd for automatic restart and reboot recovery.

### Prerequisites

* Linux VPS with systemd (for example Ubuntu 22.04+)
* Docker Engine with Compose plugin
* `git`, `curl`, and `jq`
* sudo access

### Install

#### 1. Place Repository on Host

```bash
sudo mkdir -p /opt/orloj
sudo chown "$USER":"$USER" /opt/orloj
git clone https://github.com/OrlojHQ/orloj.git /opt/orloj
cd /opt/orloj
```

#### 2. Configure Runtime Variables

```bash
cp docs/deploy/vps/.env.vps.example docs/deploy/vps/.env.vps
```

Edit `docs/deploy/vps/.env.vps` and rotate at minimum:

* `POSTGRES_PASSWORD`
* `ORLOJ_POSTGRES_DSN` password component
* `ORLOJ_MODEL_GATEWAY_PROVIDER` and key if not using mock

#### 3. Validate Compose Config

```bash
docker compose --env-file docs/deploy/vps/.env.vps -f docs/deploy/vps/docker-compose.vps.yml config
```

#### 4. Install systemd Unit

```bash
sudo cp docs/deploy/vps/orloj-compose.service /etc/systemd/system/orloj.service
sudo systemctl daemon-reload
sudo systemctl enable --now orloj
```

### Verify

Service status:

```bash
sudo systemctl status orloj --no-pager
```

Stack and health checks:

```bash
docker compose --env-file docs/deploy/vps/.env.vps -f docs/deploy/vps/docker-compose.vps.yml ps
curl -s http://127.0.0.1:8080/healthz | jq .
go run ./cmd/orlojctl get workers
```

Sample task execution:

```bash
go run ./cmd/orlojctl apply -f examples/blueprints/pipeline/
go run ./cmd/orlojctl get task bp-pipeline-task
```

Done means:

* `orloj` systemd unit is active.
* stack survives restart (`sudo systemctl restart orloj`).
* health and worker checks pass.
* sample task reaches `Succeeded`.

### Operate

Restart stack:

```bash
sudo systemctl restart orloj
```

Tail service logs:

```bash
sudo journalctl -u orloj -f
```

Tail compose logs:

```bash
docker compose --env-file docs/deploy/vps/.env.vps -f docs/deploy/vps/docker-compose.vps.yml logs -f
```

Upgrade flow:

1. `git pull` in `/opt/orloj`.
2. `sudo systemctl reload orloj`.
3. rerun verification checks.

### Troubleshoot

* `docker compose ... config` fails: fix missing/invalid `.env.vps` values.
* systemd start fails: verify docker binary path and service logs (`journalctl -u orloj`).
* workers absent: verify `ORLOJ_AGENT_MESSAGE_CONSUME=true` and message-bus settings.

### Security Defaults

* This is a single-node baseline, not HA.
* Bind or firewall `8080` to trusted networks only.
* API auth defaults to `ORLOJ_AUTH_MODE=native`; complete `/ui/setup` on first boot.
* Generate and rotate an API token (`openssl rand -hex 32`), set `ORLOJ_API_TOKEN` on the server, and reuse the same value for CLI/automation—see [Control plane API tokens](../operations/security.md#control-plane-api-tokens).

### Related Docs

* [Deployment Assets (`docs/deploy/vps`)](../../deploy/vps/README.md)
* [Operations Runbook](../operations/runbook.md)


## Agents and Agent Systems

An **Agent** is a declarative unit of work backed by a language model. An **AgentSystem** composes multiple agents into a directed graph that Orloj executes as a coordinated workflow.

### Agents

An Agent manifest defines what the agent does (its prompt), what model powers it, what tools it can call, and what constraints bound its execution.

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  prompt: |
    You are a research assistant.
    Produce concise evidence-backed answers.
  tools:
    - web_search
    - vector_db
  memory:
    ref: research-memory
  roles:
    - analyst-role
  limits:
    max_steps: 6
    timeout: 30s
```

#### Key Fields

| Field              | Description                                                                                                      |
| ------------------ | ---------------------------------------------------------------------------------------------------------------- |
| `model_ref`        | Required reference to a [ModelEndpoint](./model-routing.md) resource for provider-aware routing.                 |
| `prompt`           | The system instruction that defines the agent's behavior.                                                        |
| `tools`            | List of [Tool](./tools-and-isolation.md) names this agent may call. Tool calls are subject to governance checks. |
| `roles`            | Bound [AgentRole](./governance.md) names. Roles carry permissions that authorize tool usage.                     |
| `memory.ref`       | Reference to a [Memory](./memory/index.md) resource. This attaches the memory backend to the agent.              |
| `memory.allow`     | Explicit list of built-in memory operations the agent may use: `read`, `write`, `search`, `list`, `ingest`.      |
| `limits.max_steps` | Maximum execution steps per task turn. Defaults to `10`.                                                         |
| `limits.timeout`   | Maximum wall-clock time per task turn.                                                                           |

#### How an Agent Executes

When the runtime activates an agent during a task, it:

1. Initializes the agent's conversation history with the system prompt and current task context.
2. If `memory.ref` is set, wires the backing memory store into the runtime. If `memory.allow` is also set, the runtime exposes only those built-in memory operations as available tools.
3. Routes the request to the configured model via the model gateway, sending the full conversation history.
4. If the model selects tool calls, the runtime checks governance (AgentPolicy, AgentRole, ToolPermission) and executes authorized tools. Memory tool calls are handled internally without network calls. Tool results are sent back using the provider's native structured tool protocol (`role: "tool"` with `tool_call_id` for OpenAI, `tool_result` content blocks for Anthropic).
5. Results are appended to the conversation history and sent back to the model for the next step. The agent completes when the model produces text output without requesting further tools, or when `max_steps` / `timeout` is reached. Already-called tools are removed from the available list to prevent duplicate calls.

Conversation history is maintained for the full duration of the agent's activation, giving the model continuity across reasoning and tool-use steps. See [Memory](./memory/index.md) for details on memory layers and built-in tools.

### Agent Systems

An AgentSystem wires agents into a directed graph. The graph defines how messages flow between agents during task execution.

```yaml
apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: report-system
  labels:
    orloj.dev/domain: reporting
    orloj.dev/usecase: weekly-report
spec:
  agents:
    - planner-agent
    - research-agent
    - writer-agent
  graph:
    planner-agent:
      next: research-agent
    research-agent:
      next: writer-agent
```

#### Graph Topologies

The `graph` field supports three fundamental patterns:

**Pipeline** -- sequential stage-by-stage execution where each agent hands off to the next.

```yaml
graph:
  planner-agent:
    edges:
      - to: research-agent
  research-agent:
    edges:
      - to: writer-agent
```

**Hierarchical** -- a manager delegates to leads, who delegate to workers, with a join gate that waits for all branches before proceeding.

```yaml
graph:
  manager-agent:
    edges:
      - to: research-lead-agent
      - to: social-lead-agent
  research-lead-agent:
    edges:
      - to: research-worker-agent
  social-lead-agent:
    edges:
      - to: social-worker-agent
  research-worker-agent:
    edges:
      - to: editor-agent
  social-worker-agent:
    edges:
      - to: editor-agent
  editor-agent:
    join:
      mode: wait_for_all
```

**Swarm with loop** -- parallel scouts report back to a coordinator in iterative cycles, bounded by `Task.spec.max_turns`.

```yaml
graph:
  coordinator-agent:
    edges:
      - to: scout-alpha-agent
      - to: scout-beta-agent
      - to: synthesizer-agent
  scout-alpha-agent:
    edges:
      - to: coordinator-agent
  scout-beta-agent:
    edges:
      - to: coordinator-agent
```

#### Fan-out and Fan-in

When a graph node has multiple outbound edges, messages fan out to all targets in parallel. Fan-in is handled through join gates:

| Join Mode      | Behavior                                                                          |
| -------------- | --------------------------------------------------------------------------------- |
| `wait_for_all` | Waits for every upstream branch to complete before activating the join node.      |
| `quorum`       | Activates after `quorum_count` or `quorum_percent` of upstream branches complete. |

If an upstream branch fails, the `on_failure` policy determines behavior: `deadletter` (default), `skip`, or `continue_partial`.

#### Labels

Labels on AgentSystem metadata follow Kubernetes conventions and are useful for filtering, governance scoping, and operational grouping:

```yaml
metadata:
  labels:
    orloj.dev/domain: reporting
    orloj.dev/usecase: weekly-report
    orloj.dev/env: dev
```

### Related Resources

* [Resource Reference: Agent and AgentSystem](../reference/resources.md)
* [Memory](./memory/index.md)
* [Execution and Messaging](../architecture/execution-model.md)
* [Starter Blueprints](../architecture/starter-blueprints.md)
* [Guide: Deploy Your First Pipeline](../guides/deploy-pipeline.md)


## Governance and Policies

Orloj provides a built-in governance layer that controls what agents can do at runtime. Three resource types work together to enforce authorization: **AgentPolicy** constrains execution parameters, **AgentRole** grants named permissions to agents, and **ToolPermission** defines what permissions are required to invoke a tool.

Governance is fail-closed: if an agent uses roles and lacks the required permissions for a tool call, the call is denied with a `tool_permission_denied` error.

For simpler use cases, agents can use `spec.allowed_tools` to pre-authorize tools without needing roles or ToolPermission resources. This is the recommended starting point. The full RBAC model (AgentRole + ToolPermission) is available for advanced governance needs.

### AgentPolicy

An AgentPolicy sets execution constraints on agent systems or tasks. Policies can restrict model usage, block specific tools, and cap token consumption.

```yaml
apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-policy
spec:
  apply_mode: scoped
  target_systems:
    - report-system
  max_tokens_per_run: 50000
  allowed_models:
    - gpt-4o
  blocked_tools:
    - filesystem_delete
```

#### Key Fields

| Field                | Description                                                                                  |
| -------------------- | -------------------------------------------------------------------------------------------- |
| `apply_mode`         | `scoped` (default) applies only to listed targets. `global` applies to all systems/tasks.    |
| `target_systems`     | AgentSystem names this policy applies to (when `scoped`).                                    |
| `target_tasks`       | Task names this policy applies to (when `scoped`).                                           |
| `allowed_models`     | Whitelist of permitted model identifiers. Agents configured with unlisted models are denied. |
| `blocked_tools`      | Tools that may not be invoked under this policy, regardless of agent permissions.            |
| `max_tokens_per_run` | Maximum token budget for a single task execution.                                            |

### Simple Path: `allowed_tools`

For most agents, you can skip roles and ToolPermission entirely by listing tools in the agent's `spec.allowed_tools` field. Tools in this list are pre-authorized and bypass RBAC checks:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  tools:
    - web_search
    - vector_db
  allowed_tools:
    - web_search
    - vector_db
  prompt: |
    You are a research assistant.
```

This agent can invoke both `web_search` and `vector_db` without any AgentRole or ToolPermission resources. `spec.tools` declares which tools the agent can select during execution; `spec.allowed_tools` declares which of those tools are pre-authorized.

AgentPolicy constraints (like `blocked_tools` and `max_tokens_per_run`) still apply. `allowed_tools` only bypasses the role-based permission check.

### Advanced Path: Roles and ToolPermission

For fine-grained access control, use AgentRole and ToolPermission resources. This is recommended when you need per-tool permission auditing, scoped tool access across teams, or separation of duties between agent authors and platform operators.

### AgentRole

An AgentRole is a named set of permission strings. Agents bind to roles through their `spec.roles` field, which grants them the associated permissions.

```yaml
apiVersion: orloj.dev/v1
kind: AgentRole
metadata:
  name: analyst-role
spec:
  description: Can call web search style tools.
  permissions:
    - tool:web_search:invoke
    - capability:web.read
```

Permission strings follow a hierarchical convention:

| Pattern                        | Meaning                                  |
| ------------------------------ | ---------------------------------------- |
| `tool:<tool_name>:invoke`      | Permission to invoke a specific tool.    |
| `capability:<capability_name>` | Permission to use a declared capability. |

An agent that binds multiple roles accumulates the union of all granted permissions:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent-governed-allow
spec:
  model_ref: openai-default
  roles:
    - analyst-role
    - vector-reader-role
  tools:
    - web_search
    - vector_db
```

### ToolPermission

A ToolPermission defines what permissions are required to invoke a specific tool. When an agent attempts to call a tool, the runtime checks the agent's accumulated permissions against the tool's ToolPermission.

```yaml
apiVersion: orloj.dev/v1
kind: ToolPermission
metadata:
  name: web-search-invoke
spec:
  tool_ref: web_search
  action: invoke
  match_mode: all
  apply_mode: global
  required_permissions:
    - tool:web_search:invoke
    - capability:web.read
```

#### Key Fields

| Field                  | Description                                                               |
| ---------------------- | ------------------------------------------------------------------------- |
| `tool_ref`             | The tool this permission gate applies to. Defaults to `metadata.name`.    |
| `action`               | The action being gated. Defaults to `invoke`.                             |
| `match_mode`           | `all` requires every listed permission. `any` requires at least one.      |
| `apply_mode`           | `global` applies to all agents. `scoped` applies only to `target_agents`. |
| `required_permissions` | Permission strings the agent must hold.                                   |

### How Authorization Works

When an agent selects a tool call during execution, the runtime evaluates authorization in this order:

1. **AgentPolicy check** -- Is the tool in the policy's `blocked_tools` list? If yes, deny.
2. **ToolPermission lookup** -- Find the ToolPermission for this tool and action.
3. **Permission matching** -- Collect the agent's permissions from all bound AgentRoles. Check them against `required_permissions` using the configured `match_mode`.
4. **Decision** -- If all checks pass, the tool is invoked. If any check fails, the call returns `tool_permission_denied`.

```
Agent selects tool call
        │
        ▼
  AgentPolicy check
  (blocked_tools?)
        │
    ┌───┴───┐
    │blocked │──► Denied
    └───┬───┘
        │ allowed
        ▼
  ToolPermission lookup
        │
        ▼
  Permission matching
  (agent roles vs required)
        │
    ┌───┴───┐
    │ fail  │──► Denied (tool_permission_denied)
    └───┬───┘
        │ pass
        ▼
   Tool invoked
```

### End-to-End Example

To set up a governed agent that can search the web but not access the filesystem:

**1. Define the role:**

```yaml
apiVersion: orloj.dev/v1
kind: AgentRole
metadata:
  name: analyst-role
spec:
  description: Can call web search style tools.
  permissions:
    - tool:web_search:invoke
    - capability:web.read
```

**2. Define the tool permission:**

```yaml
apiVersion: orloj.dev/v1
kind: ToolPermission
metadata:
  name: web-search-invoke
spec:
  tool_ref: web_search
  action: invoke
  match_mode: all
  required_permissions:
    - tool:web_search:invoke
    - capability:web.read
```

**3. Define the policy:**

```yaml
apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-policy
spec:
  apply_mode: scoped
  target_systems:
    - report-system-governed
  allowed_models:
    - gpt-4o
  blocked_tools:
    - filesystem_delete
```

**4. Bind the role to the agent:**

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent-governed
spec:
  model_ref: openai-default
  roles:
    - analyst-role
  tools:
    - web_search
    - vector_db
```

In this configuration, `research-agent-governed` can invoke `web_search` (it holds the required permissions) but cannot invoke `vector_db` (it lacks `tool:vector_db:invoke`). Any attempt to call `filesystem_delete` is blocked by the policy regardless of permissions.

### Related Resources

* [Resource Reference: AgentPolicy, AgentRole, ToolPermission](../reference/resources.md)
* [Security and Isolation](../operations/security.md)
* [Guide: Set Up Multi-Agent Governance](../guides/setup-governance.md)


## Concepts

This section explains the core building blocks of Orloj and how they fit together. Each concept page covers what a resource is, why it exists, how to configure it, and how it interacts with the rest of the system.

If you are new to Orloj, start with the [Architecture Overview](../architecture/overview.md) to understand the system's layers, then work through the concepts below.

### At a Glance

```
                  TaskSchedule ──creates──▶ Task ◀──creates── TaskWebhook
                                             │
                                          triggers
                                             ▼
                                        AgentSystem
                                        ╱          ╲
                                   composes      composes
                                     ╱                ╲
                                Agent A ─────────── Agent B
                               ╱   │   ╲           ╱   │
                          calls  invokes reads  calls invokes
                            ╱      │    ╲       ╱      │
                   ModelEndpoint  Tool  Memory  │      │
                        │          │            │      │
                   resolves    resolves          │      │
                    auth via    auth via         │      │
                        ╲       ╱               │      │
                         Secret                 │      │
                                                │      │
              ┄┄┄┄┄┄┄┄ Governance ┄┄┄┄┄┄┄┄┄┄┄┄┄┤┄┄┄┄┄┄┤
              ┆                                 ┆      ┆
        AgentPolicy ┄┄ constrains ┄┄▶ Agent A, Agent B
        AgentRole   ┄┄ grants permissions to ┄▶ Agents
        ToolPermission ┄ controls access to ┄▶ Tools

              Worker ──claims and executes──▶ Task
```

### Core Resources

**[Agents and Agent Systems](./agents-and-systems.md)** -- Agents are declarative units of work backed by language models. Agent Systems compose agents into directed graphs (pipelines, hierarchies, swarm loops) that Orloj executes as coordinated workflows.

**[Tasks and Scheduling](./tasks-and-scheduling.md)** -- Tasks are requests to execute an Agent System. They carry input, track execution state through a well-defined lifecycle, and support cron-based scheduling and webhook-triggered creation.

**[Tools and Isolation](./tools-and-isolation.md)** -- Tools are external capabilities that agents invoke during execution. Orloj provides a standardized tool contract, four isolation backends (none, sandboxed, container, WASM), and configurable timeout and retry.

**[Model Routing](./model-routing.md)** -- ModelEndpoints decouple agents from specific model providers. Configure connections to OpenAI, Anthropic, Azure OpenAI, or Ollama, and bind agents to endpoints by reference.

**[Memory](./memory/index.md)** -- Memory gives agents the ability to store, retrieve, and search information using built-in tools. Operates in three layers: conversation history, task-scoped shared state, and persistent backends.

**[Governance and Policies](./governance.md)** -- AgentPolicy, AgentRole, and ToolPermission resources enforce authorization at the execution layer. The governance model is fail-closed: unauthorized tool calls are denied, not silently ignored.

### Architecture and Execution

**[Architecture Overview](../architecture/overview.md)** -- The three-layer architecture: server, workers, and governance.

**[Execution and Messaging](../architecture/execution-model.md)** -- Graph routing, fan-out/fan-in, message lifecycle, ownership guarantees, and tool selection.

**[Starter Blueprints](../architecture/starter-blueprints.md)** -- Ready-to-run pipeline, hierarchical, and swarm-loop topologies with example manifests.


## Model Routing

Orloj decouples agents from specific model providers through **ModelEndpoint** resources. A ModelEndpoint declares a provider, base URL, default model, and authentication -- and agents reference it by name. This lets you swap providers, manage credentials centrally, and route different agents to different models without modifying agent manifests.

### Model Endpoints

A ModelEndpoint resource configures a connection to a model provider.

```yaml
apiVersion: orloj.dev/v1
kind: ModelEndpoint
metadata:
  name: openai-default
spec:
  provider: openai
  base_url: https://api.openai.com/v1
  default_model: gpt-4o-mini
  auth:
    secretRef: openai-api-key
```

#### Supported Providers

| Provider     | `provider` value | Default `base_url`             |
| ------------ | ---------------- | ------------------------------ |
| OpenAI       | `openai`         | `https://api.openai.com/v1`    |
| Anthropic    | `anthropic`      | `https://api.anthropic.com/v1` |
| Azure OpenAI | `azure-openai`   | (must be set explicitly)       |
| Ollama       | `ollama`         | `http://127.0.0.1:11434`       |
| Mock         | `mock`           | (no network calls)             |

#### Provider-Specific Options

Some providers require additional configuration via the `options` field:

**Anthropic:**

```yaml
spec:
  provider: anthropic
  base_url: https://api.anthropic.com/v1
  default_model: claude-3-5-sonnet-latest
  options:
    anthropic_version: "2023-06-01"
    max_tokens: "1024"
  auth:
    secretRef: anthropic-api-key
```

**Azure OpenAI:**

```yaml
spec:
  provider: azure-openai
  base_url: https://YOUR_RESOURCE_NAME.openai.azure.com
  default_model: gpt-4o-deployment
  options:
    api_version: "2024-10-21"
  auth:
    secretRef: azure-openai-api-key
```

**Ollama** (local, no auth required):

```yaml
spec:
  provider: ollama
  base_url: http://127.0.0.1:11434
  default_model: llama3.1
```

### Binding Agents to Models

Agents configure model routing through `spec.model_ref`, which points to a ModelEndpoint:

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: writer-agent
spec:
  model_ref: openai-default
  prompt: |
    You are a writing agent.
```

### How Routing Works

When a worker executes an agent turn:

1. The runtime resolves the agent's referenced ModelEndpoint from `model_ref`.
2. The model gateway constructs a provider-specific API request using the endpoint's `base_url`, `default_model`, `options`, and auth credentials.
3. The request is sent to the provider and the response is returned to the agent execution loop.

ModelEndpoint references are resolved by name within the same namespace, or by `namespace/name` for cross-namespace references.

### Authentication

Model authentication is managed through Secret resources referenced by `auth.secretRef`. The simplest way to create one is the imperative CLI command:

```bash
orlojctl create secret openai-api-key --from-literal value=sk-your-api-key-here
```

Or with a YAML manifest via `orlojctl apply -f`:

```yaml
apiVersion: orloj.dev/v1
kind: Secret
metadata:
  name: openai-api-key
spec:
  stringData:
    value: sk-your-api-key-here
```

`stringData` values are base64-encoded into `data` during normalization and then cleared (write-only semantics). The runtime reads from `data` at execution time.

In production, you can also skip `Secret` resources entirely and inject values via environment variables (`ORLOJ_SECRET_<name>`). See [Secret Handling](../operations/security.md#secret-handling) for details.

### Governance Integration

AgentPolicy resources can restrict which models an agent is allowed to use via the `allowed_models` field:

```yaml
apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-policy
spec:
  allowed_models:
    - gpt-4o
  max_tokens_per_run: 50000
```

If an agent's resolved endpoint `default_model` is not in the policy's `allowed_models` list, execution is denied.

### Related Resources

* [Resource Reference: ModelEndpoint, Secret](../reference/resources.md)
* [Configuration](../operations/configuration.md)
* [Guide: Configure Model Routing](../guides/configure-model-routing.md)


## Tasks and Scheduling

A **Task** is a request to execute an AgentSystem. Tasks are the unit of work in Orloj -- they carry input, track execution state, and produce output. **TaskSchedules** and **TaskWebhooks** automate task creation from cron expressions and external events.

### Tasks

A Task binds an AgentSystem to a specific input and execution configuration.

```yaml
apiVersion: orloj.dev/v1
kind: Task
metadata:
  name: weekly-report
spec:
  system: report-system
  input:
    topic: AI startups
  priority: high
  retry:
    max_attempts: 3
    backoff: 5s
  message_retry:
    max_attempts: 2
    backoff: 250ms
    max_backoff: 2s
    jitter: full
  requirements:
    region: default
    model: gpt-4o
```

#### Task Lifecycle

Every task moves through a well-defined set of phases:

```
Pending ──► Running ──► Succeeded
                   └──► Failed
                   └──► DeadLetter
```

| Phase        | Meaning                                                                           |
| ------------ | --------------------------------------------------------------------------------- |
| `Pending`    | Task is created and waiting for a worker to claim it.                             |
| `Running`    | A worker has claimed the task and is executing the agent graph.                   |
| `Succeeded`  | All agents in the graph completed successfully.                                   |
| `Failed`     | Execution failed and retries are not exhausted. May transition back to `Pending`. |
| `DeadLetter` | All retry attempts exhausted. Terminal state requiring manual investigation.      |

#### Worker Assignment and Leases

The scheduler assigns tasks to workers based on `requirements` (region, GPU, model). Workers claim tasks through a lease mechanism:

1. Scheduler matches task requirements to worker capabilities.
2. Worker claims the task and acquires a time-bounded lease.
3. Worker renews the lease via heartbeats during execution.
4. If the lease expires (worker crash, network partition), another worker may safely take over.

This guarantees exactly-once processing semantics even under failure.

#### Retry Configuration

Tasks support two levels of retry:

**Task-level retry** (`spec.retry`) -- retries the entire task from the beginning if it fails.

```yaml
retry:
  max_attempts: 3
  backoff: 5s
```

**Message-level retry** (`spec.message_retry`) -- retries individual agent-to-agent messages within the graph without restarting the full task.

```yaml
message_retry:
  max_attempts: 2
  backoff: 250ms
  max_backoff: 2s
  jitter: full
```

Retry uses capped exponential backoff with configurable jitter (`none`, `full`, `equal`). Messages that exhaust retries transition to `deadletter` phase.

#### Cyclic Graphs

For AgentSystems with cycles (loops), `spec.max_turns` bounds the number of iterations to prevent infinite execution:

```yaml
spec:
  system: manager-research-loop-system
  input:
    topic: AI coding assistants
  max_turns: 6
```

#### Task Templates

Tasks with `mode: template` serve as templates for TaskSchedules and TaskWebhooks. They are not executed directly.

```yaml
spec:
  mode: template
  system: report-system
  input:
    topic: AI startups
```

### Task Schedules

A TaskSchedule creates tasks on a cron-based schedule from a template task.

```yaml
apiVersion: orloj.dev/v1
kind: TaskSchedule
metadata:
  name: weekly-report
spec:
  task_ref: weekly-report-template
  schedule: "0 9 * * 1"
  time_zone: America/Chicago
  suspend: false
  starting_deadline_seconds: 300
  concurrency_policy: forbid
  successful_history_limit: 10
  failed_history_limit: 3
```

| Field                       | Description                                                      |
| --------------------------- | ---------------------------------------------------------------- |
| `schedule`                  | Standard 5-field cron expression.                                |
| `time_zone`                 | IANA timezone (defaults to `UTC`).                               |
| `concurrency_policy`        | `forbid` prevents overlapping runs.                              |
| `starting_deadline_seconds` | Maximum lateness before a missed trigger is skipped.             |
| `suspend`                   | Set to `true` to pause scheduling without deleting the resource. |

### Task Webhooks

A TaskWebhook creates tasks in response to external HTTP events, with built-in signature verification and idempotency.

```yaml
apiVersion: orloj.dev/v1
kind: TaskWebhook
metadata:
  name: report-github-push
spec:
  task_ref: weekly-report-template
  auth:
    profile: github
    secret_ref: webhook-shared-secret
  idempotency:
    event_id_header: X-GitHub-Delivery
    dedupe_window_seconds: 86400
  payload:
    mode: raw
    input_key: webhook_payload
```

Supported auth profiles:

| Profile   | Signature Method                             | Headers                                    |
| --------- | -------------------------------------------- | ------------------------------------------ |
| `generic` | HMAC-SHA256 over `timestamp + "." + rawBody` | `X-Signature`, `X-Timestamp`, `X-Event-Id` |
| `github`  | HMAC-SHA256 over raw body                    | `X-Hub-Signature-256`, `X-GitHub-Delivery` |

### Workers

Workers are the execution units that claim and run tasks. They register capabilities and the scheduler uses these for task matching.

```yaml
apiVersion: orloj.dev/v1
kind: Worker
metadata:
  name: worker-a
spec:
  region: default
  max_concurrent_tasks: 1
  capabilities:
    gpu: false
    supported_models:
      - gpt-4o
```

### Related Resources

* [Resource Reference: Task, TaskSchedule, TaskWebhook, Worker](../reference/resources.md)
* [Execution and Messaging](../architecture/execution-model.md)
* [Troubleshooting](../operations/troubleshooting.md)


## Tools and Isolation

A **Tool** is an external capability that agents can invoke during execution. Orloj provides a standardized tool contract, multiple isolation backends, and runtime controls for timeout, retry, and risk classification.

### Defining a Tool

Tools are declared as resources that describe the tool's endpoint, auth requirements, risk level, and runtime configuration.

```yaml
apiVersion: orloj.dev/v1
kind: Tool
metadata:
  name: web_search
spec:
  type: http
  endpoint: https://api.search.com
  auth:
    secretRef: search-api-key
```

For tools that require isolation and runtime controls:

```yaml
apiVersion: orloj.dev/v1
kind: Tool
metadata:
  name: wasm_echo
spec:
  type: wasm
  capabilities:
    - wasm.echo.invoke
  risk_level: high
  runtime:
    isolation_mode: wasm
    timeout: 5s
    retry:
      max_attempts: 1
      backoff: 0s
      max_backoff: 1s
      jitter: none
```

#### Key Fields

| Field                    | Description                                                             |
| ------------------------ | ----------------------------------------------------------------------- |
| `type`                   | Tool type. Determines the transport and execution model. See below.     |
| `endpoint`               | The tool's network endpoint.                                            |
| `capabilities`           | Declared operations this tool provides. Used for permission matching.   |
| `risk_level`             | `low`, `medium`, `high`, or `critical`. Affects default isolation mode. |
| `runtime.isolation_mode` | Execution isolation backend (see below).                                |
| `runtime.timeout`        | Maximum execution time. Defaults to `30s`.                              |
| `runtime.retry`          | Retry policy for failed invocations.                                    |
| `auth.secretRef`         | Reference to a Secret resource for tool authentication.                 |

### Tool Types

`Tool.spec.type` determines how the runtime communicates with the tool. Six types are supported:

| Type               | Transport                                                    | Contract                                                                                          | Use case                                                                        |
| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| `http`             | HTTP POST to `endpoint`                                      | Raw body or `ToolExecutionResponse`                                                               | Simple API integrations. Default when omitted.                                  |
| `external`         | HTTP POST to `endpoint`                                      | Strict `ToolExecutionRequest` / `ToolExecutionResponse`                                           | Tools running as standalone microservices that need the full execution context. |
| `grpc`             | Unary gRPC call to `endpoint`                                | `ToolExecutionRequest` / `ToolExecutionResponse` as JSON over `orloj.tool.v1.ToolService/Execute` | Teams that prefer gRPC for tool communication.                                  |
| `webhook-callback` | HTTP POST to `endpoint`, then poll `{endpoint}/{request_id}` | `ToolExecutionRequest` / `ToolExecutionResponse`                                                  | Long-running tools, batch jobs, or tools that require human-in-the-loop steps.  |
| `mcp`              | JSON-RPC 2.0 via stdio or HTTP                               | MCP `tools/call` / `tools/list`                                                                   | Tools exposed by MCP servers. Auto-generated by the McpServer controller.       |
| `queue`            | Reserved for future use                                      | --                                                                                                | Planned for message-queue-backed tools.                                         |

All types flow through the same governed runtime pipeline -- policy enforcement, retry, timeout, auth injection, and error taxonomy behave identically regardless of tool type.

Unknown type values are rejected at apply time.

#### MCP

The `mcp` type represents tools provided by an MCP (Model Context Protocol) server. These tools are auto-generated by the `McpServer` controller -- you do not create them manually. Each `type=mcp` tool carries `mcp_server_ref` (the McpServer that owns it) and `mcp_tool_name` (the tool name on the MCP server).

At invocation time, the `MCPToolRuntime` resolves the server reference, obtains a session from the `McpSessionManager`, and sends a `tools/call` JSON-RPC 2.0 request through the appropriate transport (stdio or Streamable HTTP).

MCP tools also carry `description` and `input_schema` from the MCP server's `tools/list` response. These are propagated to the model gateway so the LLM receives rich, structured tool definitions instead of generic parameter schemas.

See the [Connect an MCP Server](../guides/connect-mcp-server.md) guide for a complete walkthrough.

#### HTTP (default)

The `http` type sends the agent's tool input as an HTTP POST body to `spec.endpoint`. The runtime accepts both raw text responses and structured `ToolExecutionResponse` JSON envelopes. Auth is injected as an `Authorization: Bearer` header when `auth.secretRef` is configured.

#### External

The `external` type sends the full `ToolExecutionRequest` contract envelope as JSON to `spec.endpoint` and expects a `ToolExecutionResponse` back. This gives the external service access to the full execution context (task ID, agent, namespace, trace IDs, attempt number). Use this when your tool needs to be aware of the Orloj execution context.

#### gRPC

The `grpc` type calls `orloj.tool.v1.ToolService/Execute` as a unary gRPC method on `spec.endpoint`, using a JSON codec. The request and response payloads are the same `ToolExecutionRequest` / `ToolExecutionResponse` envelopes as `external`. Use this when your tool infrastructure is gRPC-native.

#### Webhook-Callback

The `webhook-callback` type supports asynchronous tool execution:

1. The runtime POSTs a `ToolExecutionRequest` to `spec.endpoint`.
2. The tool returns `202 Accepted` (or `200 OK` with an immediate result).
3. If `202`: the runtime polls `{endpoint}/{request_id}` at regular intervals until a `ToolExecutionResponse` with a terminal status arrives, or the configured timeout expires.

This is useful for tools that take minutes to complete (e.g., batch processing, code review, CI pipeline triggers) or that require external approval before returning a result.

### Isolation Modes

Isolation modes control the execution boundary of a tool, independent of tool type.

| Mode        | Description                                                                                                                                                             | Default for                      |
| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| `none`      | Direct execution in the worker process. The `http` type makes real HTTP calls; other types use their respective transports.                                             | `low` and `medium` risk tools    |
| `sandboxed` | Restricted container execution with secure defaults: read-only filesystem, no capabilities, no privilege escalation, no network, non-root user, memory/CPU/pids limits. | `high` and `critical` risk tools |
| `container` | Each tool invocation runs in an isolated container. Full filesystem and network isolation.                                                                              | Explicitly configured            |
| `wasm`      | Tool runs as a WebAssembly module with a host-guest stdin/stdout contract. Memory-safe and deterministic.                                                               | Explicitly configured            |

The isolation mode defaults are based on `risk_level`:

* `low` or `medium` risk: defaults to `none`
* `high` or `critical` risk: defaults to `sandboxed`

You can always override the default by setting `runtime.isolation_mode` explicitly.

#### Sandboxed Defaults

When `isolation_mode` is `sandboxed`, the container backend enforces these secure defaults:

* `--read-only` filesystem
* `--cap-drop=ALL` (no Linux capabilities)
* `--security-opt no-new-privileges`
* `--network none` (no network access)
* `--user 65532:65532` (non-root)
* `--memory 128m`
* `--cpus 0.50`
* `--pids-limit 64`

These defaults can be overridden with `--tool-container-*` flags on `orlojd` and `orlojworker`, but the default posture is restrictive.

### Tool Contract v1

Every tool interaction follows a standardized request/response envelope. This contract ensures tools are portable, testable, and observable regardless of the isolation backend.

**Request envelope** (sent to the tool):

```json
{
  "request_id": "req-abc-123",
  "tool": "web_search",
  "action": "invoke",
  "parameters": {
    "query": "enterprise AI adoption trends"
  },
  "auth": {
    "type": "bearer",
    "token": "sk-..."
  },
  "context": {
    "task": "weekly-report",
    "agent": "research-agent",
    "attempt": 1
  }
}
```

**Response envelope** (returned from the tool):

```json
{
  "request_id": "req-abc-123",
  "status": "success",
  "result": {
    "data": "..."
  }
}
```

**Error response:**

```json
{
  "request_id": "req-abc-123",
  "status": "error",
  "error": {
    "tool_code": "rate_limited",
    "tool_reason": "API rate limit exceeded",
    "retryable": true
  }
}
```

The tool contract defines a canonical error taxonomy with `tool_code`, `tool_reason`, and `retryable` fields, enabling the runtime to make intelligent retry decisions.

### WASM Tools

WASM tools communicate over stdin/stdout using the same JSON envelope contract. The host writes the request to the module's stdin and reads the response from stdout. This provides memory-safe, deterministic execution with no filesystem or network access unless explicitly granted.

See the [WASM Tool Module Contract v1](../reference/wasm-tool-module-contract-v1.md) for the full specification.

### Error Taxonomy

Tool failures use a canonical error taxonomy with three fields:

| Field         | Purpose                                                                                           |
| ------------- | ------------------------------------------------------------------------------------------------- |
| `tool_code`   | Machine-readable error code (e.g. `rate_limited`, `unsupported_tool`, `secret_resolution_failed`) |
| `tool_reason` | Human-readable explanation                                                                        |
| `retryable`   | Whether the runtime should retry the invocation                                                   |

HTTP status codes are mapped automatically: `429` and `5xx` are retryable, `4xx` are not. HTTP `401` maps to `auth_invalid` and `403` maps to `auth_forbidden` -- both non-retryable. All tool types share the same taxonomy, so policy and observability behave consistently.

### Auth Profiles

Tools support four authentication profiles via `spec.auth.profile`:

| Profile                     | Secret format                                         | Injection                                                     |
| --------------------------- | ----------------------------------------------------- | ------------------------------------------------------------- |
| `bearer` (default)          | Single token value                                    | `Authorization: Bearer <token>`                               |
| `api_key_header`            | Single key value                                      | Custom header via `spec.auth.headerName`                      |
| `basic`                     | `username:password`                                   | `Authorization: Basic <base64>`                               |
| `oauth2_client_credentials` | Multi-key secret with `client_id` and `client_secret` | Token exchange at `spec.auth.tokenURL`, then bearer injection |

When `spec.auth.secretRef` is set without an explicit profile, the default is `bearer` for backward compatibility. See the [Tool Contract v1](../reference/tool-contract-v1.md) for full auth binding details.

#### Secret Rotation

Secret resolution is performed fresh per tool invocation -- there is no caching of raw secret values. If a secret is rotated between invocations, the new value takes effect on the next call without requiring a restart.

For `oauth2_client_credentials`, access tokens are cached with a TTL derived from the token endpoint's `expires_in` response. Tokens are evicted on expiry or when the tool endpoint returns HTTP 401, triggering a fresh token exchange.

### Retry and Timeout

Each tool can configure its own retry policy independently of the task-level retry:

```yaml
runtime:
  timeout: 5s
  retry:
    max_attempts: 3
    backoff: 1s
    max_backoff: 30s
    jitter: full
```

Retry uses capped exponential backoff. The `jitter` field controls randomization: `none` (deterministic), `full` (random between 0 and backoff), or `equal` (half deterministic, half random).

### Governance Integration

Tool invocations are gated by the [governance layer](./governance.md). An agent must have the required permissions (via AgentRole) to invoke a tool, and the tool must not be blocked by any applicable AgentPolicy. Unauthorized calls fail closed with a `tool_permission_denied` error.

### Operation Classes

Tools can declare operation classes via `spec.operation_classes` (e.g. `read`, `write`, `delete`, `admin`). When omitted, the default is `["read"]` for low/medium risk tools and `["write"]` for high/critical risk tools.

Operation classes are used by `ToolPermission.spec.operation_rules` to define per-class policy verdicts:

* **allow**: proceed with the tool call (default).
* **deny**: block the call with a `permission_denied` error.
* **approval\_required**: pause the task and create a `ToolApproval` resource. An external actor (human or system) must approve or deny the request before the task can continue.

When multiple rules match, the most restrictive verdict wins: `deny` > `approval_required` > `allow`.

### Approval Workflow

When a tool call is flagged as `approval_required`, the following happens:

1. The `GovernedToolRuntime` returns an `ErrToolApprovalRequired` sentinel error.
2. The task controller transitions the task to the `WaitingApproval` phase.
3. A `ToolApproval` resource is created with details about the pending call.
4. An external actor approves or denies the request via the API (`POST /v1/tool-approvals/{name}/approve` or `/deny`).
5. The task controller reconciles the approval status:
   * **Approved**: task resumes to `Running`.
   * **Denied**: task transitions to `Failed` with `approval_denied`.
   * **Expired** (TTL elapsed): task transitions to `Failed` with `approval_timeout`.

Approval-related outcomes are non-retryable and do not consume retry budget.

### Related Resources

* [Resource Reference: Tool](../reference/resources.md)
* [Resource Reference: McpServer](../reference/resources.md#mcpserver)
* [Tool Contract v1](../reference/tool-contract-v1.md)
* [WASM Tool Module Contract v1](../reference/wasm-tool-module-contract-v1.md)
* [Tool Runtime Conformance](../operations/tool-runtime-conformance.md)
* [Guide: Connect an MCP Server](../guides/connect-mcp-server.md)
* [Guide: Build a Custom Tool](../guides/build-custom-tool.md)


## Memory

Memory gives agents the ability to store, retrieve, and search information across execution steps and across tasks. Orloj implements memory as a layered system: conversation history provides short-term context within a single task turn, a task-scoped shared store lets agents in the same task exchange state, and persistent backends retain knowledge across task runs.

### How Memory Works

When an agent has `spec.memory.ref` set to a Memory resource, the runtime attaches that memory backend to the agent. Built-in memory operations are granted explicitly through `spec.memory.allow`, and the runtime exposes only those allowed operations as callable built-in tools. They behave like tools during execution, but are handled internally by the runtime without network calls.

```yaml
apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  prompt: |
    You are a research assistant. Use memory tools to store and retrieve findings.
  tools:
    - web_search
  memory:
    ref: research-memory
    allow:
      - read
      - write
      - search
  limits:
    max_steps: 10
```

The `memory.ref` field points to a Memory resource that configures the backing store:

```yaml
apiVersion: orloj.dev/v1
kind: Memory
metadata:
  name: research-memory
spec:
  type: vector
  provider: in-memory
```

For vector-similarity search with PostgreSQL and pgvector:

```yaml
apiVersion: orloj.dev/v1
kind: Memory
metadata:
  name: production-memory
spec:
  type: vector
  provider: pgvector
  endpoint: postgres://orloj@pgvector-host:5432/memories
  embedding_model: openai-embeddings   # references a ModelEndpoint
  auth:
    secretRef: pg-password
```

For a custom vector database via the HTTP adapter:

```yaml
apiVersion: orloj.dev/v1
kind: Memory
metadata:
  name: custom-vectordb
spec:
  type: vector
  provider: http
  endpoint: https://my-vector-adapter.example.com
  auth:
    secretRef: vector-db-api-key
```

See [Memory Providers](./providers.md) for the full list of supported providers.

### Memory Layers

#### Conversation History

Every agent accumulates a message history during multi-turn execution within a single task turn. The system prompt, user context, model responses, and tool results are all appended to the conversation and sent to the model on each step. This gives the model continuity across reasoning steps without explicit memory tool calls.

Conversation history is ephemeral -- it exists only for the duration of the agent's current activation and is not shared between agents.

#### Task-Scoped Shared Memory

When no persistent backend is configured (or as a fallback), memory tools operate on an in-process key-value store scoped to the current task. All agents within the same task share this store, enabling coordination:

* Agent A writes `memory.write({"key": "findings", "value": "..."})`
* Agent B reads `memory.read({"key": "findings"})`

This store is ephemeral and cleared when the task completes.

#### Persistent Backends

When a Memory resource specifies a persistent provider, memory tools delegate to the configured backend. Data written by one task is available to future tasks that reference the same Memory resource.

There are two ways to connect a vector database:

**Built-in providers** -- Orloj ships Go implementations that connect directly to popular databases. Users configure `spec.endpoint` and `spec.auth.secretRef` on the Memory CRD and Orloj handles the rest. No extra infrastructure needed.

**HTTP adapter** -- For databases without a built-in provider, users deploy a lightweight adapter service that speaks a simple JSON contract and set `provider: http`. The adapter can be written in any language.

See [Memory Providers](./providers.md) for full details on each provider, configuration examples, and how to build custom providers.

### Built-in Memory Tools

When `spec.memory.ref` is set and `spec.memory.allow` grants the corresponding operations, the runtime exposes the following built-in tools. They do not need to be listed in `spec.tools`.

#### `memory.read`

Retrieve a value by key.

```json
{"key": "research-findings"}
```

Returns `{"found": true, "key": "research-findings", "value": "..."}` or `{"found": false, "key": "research-findings"}`.

#### `memory.write`

Store a value under a key. Overwrites any existing value.

```json
{"key": "research-findings", "value": "The study shows..."}
```

Returns `{"status": "ok", "key": "research-findings"}`.

#### `memory.search`

Search stored entries by keyword (or vector similarity when a persistent backend with embeddings is configured).

```json
{"query": "climate data", "top_k": 5}
```

Returns `{"results": [{"key": "...", "value": "...", "score": 1.0}], "count": 3}`.

#### `memory.list`

List stored entries, optionally filtered by key prefix.

```json
{"prefix": "research/"}
```

Returns `{"entries": [{"key": "...", "value": "..."}], "count": 5}`.

#### `memory.ingest`

Chunk a document and store the pieces for later search. Useful for loading text files, reports, or other documents into memory.

```json
{
  "source": "quarterly-report",
  "content": "Full text of the document...",
  "chunk_size": 1000,
  "overlap": 200
}
```

The tool splits the content into overlapping windows and stores each chunk under `{source}/chunk-{NNNN}`. Returns `{"status": "ok", "source": "quarterly-report", "chunks_stored": 12}`.

`chunk_size` and `overlap` are optional and default to 1000 and 200 characters respectively.

### Memory in Agent Systems

In a multi-agent system, memory enables coordination between agents without requiring direct message passing for every piece of state:

* A **research agent** writes findings to memory.
* A **writer agent** reads those findings and produces a report.
* A **coordinator agent** lists memory entries to track overall progress.

All agents that reference the same Memory resource (via `spec.memory.ref`) and execute within the same task share the same backing store.

### Memory Resource Configuration

The Memory resource is a declarative configuration. It tells the runtime which backend to use and how to configure it.

#### `spec` Fields

| Field             | Description                                                                                                                                                                                                   |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `type`            | Categorization of the memory use case (e.g. `vector`, `kv`). Informational; does not affect runtime behavior in v1.                                                                                           |
| `provider`        | Backend implementation. `in-memory` (default), `pgvector`, `http` (external adapter), or a registered built-in provider name. See [Memory Providers](./providers.md).                                         |
| `embedding_model` | Reference to a ModelEndpoint resource that provides an embeddings API. Required for vector providers like `pgvector`. The endpoint's `base_url`, `auth`, and `default_model` are used to generate embeddings. |
| `endpoint`        | URL or connection string for the database or adapter service. Required for `pgvector`, `http`, and cloud-hosted built-in providers. Not needed for `in-memory`.                                               |
| `auth.secretRef`  | Reference to a Secret resource containing credentials (API key, password, bearer token).                                                                                                                      |

#### Status

The controller reconciles Memory resources and reports backend health:

* **Ready** -- backend is configured and reachable.
* **Error** -- provider is unsupported or connectivity check failed. See `status.lastError` for details.

### Frontend

The Memory detail page in the UI includes an **Entries** tab that displays stored memory entries. You can search entries by keyword and browse keys and values. This is useful for debugging agent behavior and inspecting what data has been stored.

### Related Resources

* [Memory Providers](./providers.md)
* [Resource Reference: Memory](../../reference/resources.md#memory)
* [Agents and Agent Systems](../agents-and-systems.md)
* [Architecture Overview](../../architecture/overview.md)
* [API Reference](../../reference/api.md)


## Memory Providers

Memory providers are the backends that store and retrieve data for Orloj's built-in memory tools. There are two paths to connect a vector database, both coexisting:

**Built-in providers** -- Orloj ships Go implementations that connect directly to popular databases. Users configure `spec.endpoint` and `spec.auth.secretRef` on the Memory CRD and Orloj handles the rest. No extra infrastructure needed.

**HTTP adapter** -- For databases without a built-in provider, users deploy a lightweight adapter service that speaks a simple JSON contract and set `provider: http`. The adapter can be written in any language.

Both paths use the same CRD fields: `spec.endpoint` for the database URL and `spec.auth.secretRef` for credentials.

### Built-in Providers

| Provider    | Description                                                                                                                                                                                                                                            |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `in-memory` | In-process map. No endpoint needed. Useful for testing and single-instance deployments. Data is lost on restart.                                                                                                                                       |
| `pgvector`  | PostgreSQL with the pgvector extension. Full vector-similarity search via embeddings. Requires `endpoint` (Postgres DSN), `embedding_model` (ModelEndpoint reference), and optionally `auth.secretRef` (Postgres password). See [pgvector](#pgvector). |
| `http`      | Delegates to an external HTTP service at `spec.endpoint`. See [HTTP Adapter](#http-adapter).                                                                                                                                                           |

### Coming Soon

The following built-in providers are planned. Each will connect directly to the database using `spec.endpoint` and `spec.auth.secretRef` -- no adapter service required. In the meantime, any of these can be used today via the `http` adapter.

| Provider | Status  |
| -------- | ------- |
| Qdrant   | Planned |
| Pinecone | Planned |
| Weaviate | Planned |
| Chroma   | Planned |
| Milvus   | Planned |

### pgvector

The `pgvector` provider stores memory entries in PostgreSQL using the [pgvector](https://github.com/pgvector/pgvector) extension. Every write generates a vector embedding, enabling true cosine-similarity search via `memory.search`.

#### Requirements

* A PostgreSQL instance with the `vector` extension installed (pgvector).
* A ModelEndpoint that serves an OpenAI-compatible `/embeddings` API (OpenAI, Azure OpenAI, Ollama, or any compatible provider).

#### Configuration

```yaml
apiVersion: orloj.dev/v1
kind: Memory
metadata:
  name: team-knowledge
  namespace: production
spec:
  type: vector
  provider: pgvector
  endpoint: postgres://orloj@pgvector-host:5432/memories
  embedding_model: openai-embeddings
  auth:
    secretRef: pg-password
```

The `embedding_model` field references a ModelEndpoint by name:

```yaml
apiVersion: orloj.dev/v1
kind: ModelEndpoint
metadata:
  name: openai-embeddings
  namespace: production
spec:
  provider: openai
  default_model: text-embedding-3-small
  auth:
    secretRef: openai-api-key
```

| Field             | Description                                                                                                                                                                                               |
| ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `endpoint`        | Full Postgres connection string (DSN). Example: `postgres://user@host:5432/dbname`.                                                                                                                       |
| `embedding_model` | Name of a ModelEndpoint in the same namespace (or `namespace/name` for cross-namespace). The endpoint's `base_url` and `auth` are used to call the embeddings API, and `default_model` selects the model. |
| `auth.secretRef`  | Optional. Reference to a Secret containing the Postgres password. Injected into the DSN if the connection string doesn't already include one.                                                             |

#### How It Works

On creation, the memory controller:

1. Resolves the `embedding_model` ModelEndpoint and builds an embedding provider.
2. Connects to PostgreSQL using the DSN from `endpoint`.
3. Generates a test embedding to auto-detect the vector dimension.
4. Creates the `vector` extension, table, and HNSW index if they don't exist.
5. Runs `Ping` to verify connectivity.

The table schema (default table name `orloj_memory`, overridable via `spec.options.table`):

```sql
CREATE TABLE orloj_memory (
    key        TEXT PRIMARY KEY,
    value      TEXT NOT NULL,
    embedding  vector(<dim>),
    created_at TIMESTAMPTZ DEFAULT now(),
    updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX orloj_memory_embedding_idx
    ON orloj_memory USING hnsw (embedding vector_cosine_ops);
```

* **`memory.write`** embeds the value and upserts the row.
* **`memory.search`** embeds the query and performs cosine-similarity search.
* **`memory.read`** and **`memory.list`** operate on key/prefix without embeddings.
* **`memory.ingest`** chunks the document and stores each chunk with its embedding.

### HTTP Adapter

When `provider: http` is set, Orloj delegates all memory operations to an external service at `spec.endpoint`. This is the escape hatch for vector databases that don't have a built-in provider yet. The adapter can be written in any language and deployed anywhere Orloj can reach over HTTP.

#### Contract

The service must implement five endpoints. All POST endpoints accept and return `application/json`.

**`POST /put`** -- Store a key-value pair.

```json
// Request
{"key": "findings/chunk-0001", "value": "The quarterly report shows..."}
// Response
{"status": "ok"}
```

**`POST /get`** -- Retrieve a value by key.

```json
// Request
{"key": "findings/chunk-0001"}
// Response
{"found": true, "key": "findings/chunk-0001", "value": "The quarterly report shows..."}
```

**`POST /search`** -- Search entries by keyword or vector similarity.

```json
// Request
{"query": "quarterly revenue", "top_k": 5}
// Response
{"results": [{"key": "...", "value": "...", "score": 0.92}]}
```

**`POST /list`** -- List entries by key prefix.

```json
// Request
{"prefix": "findings/"}
// Response
{"entries": [{"key": "...", "value": "..."}]}
```

**`GET /ping`** -- Health check.

```json
// Response
{"status": "ok"}
```

Errors are signaled via HTTP status codes (4xx/5xx) with an optional `{"error": "message"}` body.

#### Authentication

When `spec.auth.secretRef` is set, Orloj sends an `Authorization: Bearer <token>` header on every request. The token is resolved from the referenced Secret resource.

#### Example

```yaml
apiVersion: orloj.dev/v1
kind: Memory
metadata:
  name: custom-vectordb
spec:
  provider: http
  endpoint: https://my-adapter.example.com
  auth:
    secretRef: adapter-api-key
```

### Custom Providers

For contributors adding first-party vector database support, or users building custom Orloj binaries, providers can be registered directly in Go. Implement the `PersistentMemoryBackend` interface and register a factory at startup:

```go
import agentruntime "github.com/OrlojHQ/orloj/runtime"

func init() {
    agentruntime.DefaultMemoryProviderRegistry().Register("qdrant", func(cfg agentruntime.MemoryProviderConfig) (agentruntime.PersistentMemoryBackend, error) {
        // cfg.Endpoint, cfg.AuthToken, cfg.Embedder are available
        return NewQdrantBackend(cfg)
    })
}
```

The `MemoryProviderConfig` passed to the factory contains:

| Field            | Description                                                                                                                                                                                                 |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Type`           | The `spec.type` from the Memory CRD (e.g. `vector`, `kv`).                                                                                                                                                  |
| `Provider`       | The `spec.provider` value that matched the registration.                                                                                                                                                    |
| `EmbeddingModel` | The raw `spec.embedding_model` string from the Memory CRD.                                                                                                                                                  |
| `Endpoint`       | The `spec.endpoint` URL or connection string.                                                                                                                                                               |
| `AuthToken`      | Resolved bearer token from `spec.auth.secretRef`.                                                                                                                                                           |
| `Options`        | Reserved for future provider-specific configuration.                                                                                                                                                        |
| `Embedder`       | An `EmbeddingProvider` interface (with `Embed` and `Dimensions` methods). Non-nil when `spec.embedding_model` references a valid ModelEndpoint. Vector providers should use this for generating embeddings. |

The Memory controller calls the factory, runs `Ping` to verify connectivity, and moves the resource to `Ready` if successful.


## Execution and Messaging

This page documents task routing, message lifecycle, and ownership guarantees.

### Graph Routing

`AgentSystem.spec.graph` supports two edge styles:

* `next`: legacy single edge
* `edges[]`: preferred route list with labels/policy metadata

### Fan-out and Fan-in

* Fan-out: one node routes to multiple downstream edges.
* Fan-in: downstream join gate with:
  * `wait_for_all`
  * `quorum` (`quorum_count` or `quorum_percent`)

Join state persists in `Task.status.join_states`.

### Message Lifecycle

`Task.status.messages` includes:

* lifecycle phase: `queued|running|retrypending|succeeded|deadletter`
* retry fields: `attempts`, `max_attempts`, `next_attempt_at`
* worker ownership fields: `worker`, `processed_at`, `last_error`
* routing/tracing fields: `branch_id`, `parent_branch_id`, `trace_id`, `parent_id`

### Tool Selection Model

* `Agent.spec.tools[]` defines candidate tools.
* Model responses select specific tool calls for each step.
* Only selected and authorized tools are executed.
* Unauthorized tool selections fail closed as `tool_permission_denied`.

### Ownership and Safety Guarantees

* only `Task.status.claimedBy` worker may process messages
* leases are renewed during active processing
* lease expiry allows safe takeover by another worker
* idempotency keys protect replay and crash recovery

### Choosing an Execution Mode

Orloj supports two execution modes that share the same resource model and graph definitions.

**Sequential mode** (`--task-execution-mode=sequential`) runs the entire graph in-process on the server or embedded worker. Best for getting started, development, and single-agent systems. No message bus required.

**Message-driven mode** (`--task-execution-mode=message-driven`) distributes execution across workers via the message bus. Each agent step is a queued message with durable delivery, retry, and dead-letter guarantees. Best for production, parallel fan-out, and horizontal scaling.

Both modes produce the same task trace, history, and output. You can develop in sequential mode and deploy to production in message-driven mode without changing your resource definitions.

See [Configuration](../operations/configuration.md) for the full set of flags.

### Related Docs

* [Tool Contract v1](../reference/tool-contract-v1.md)
* [Tool Runtime Conformance](../operations/tool-runtime-conformance.md)


## Architecture Overview

Orloj is organized into three layers: a **server** that manages resources and scheduling, **workers** that execute agent workflows, and a **governance layer** that enforces policies and permissions at runtime.

```
┌─────────────────────────────────────────────────────┐
│                  Server (orlojd)                     │
│                                                     │
│  ┌──────────────┐   ┌────────────────┐              │
│  │  API Server   │──►│ Resource Store  │             │
│  │   (REST)      │   │ mem/postgres   │              │
│  └──────┬───────┘   └────────────────┘              │
│         │                                           │
│         ▼                                           │
│  ┌──────────────┐   ┌────────────────┐              │
│  │   Services    │──►│ Task Scheduler │              │
│  └──────────────┘   └───────┬────────┘              │
└─────────────────────────────┼───────────────────────┘
                              │ assign tasks
                              ▼
┌─────────────────────────────────────────────────────┐
│                 Workers (orlojworker)                │
│                                                     │
│  ┌──────────────┐   ┌───────────────┐               │
│  │  Task Worker  │──►│ Model Gateway │               │
│  │              │   └───────────────┘               │
│  │              │   ┌───────────────┐               │
│  │              │──►│  Tool Runtime  │               │
│  │              │   └───────────────┘               │
│  │              │   ┌───────────────┐               │
│  │       ◄──────┼───│  Message Bus   │               │
│  │              │──►│  mem/nats-js   │               │
│  └──────────────┘   └───────────────┘               │
│         ▲                                           │
└─────────┼───────────────────────────────────────────┘
          │ enforced at runtime
┌─────────┴───────────────────────────────────────────┐
│                    Governance                        │
│                                                     │
│  ┌─────────────┐ ┌───────────┐ ┌────────────────┐   │
│  │ AgentPolicy  │ │ AgentRole │ │ ToolPermission │   │
│  └─────────────┘ └───────────┘ └────────────────┘   │
└─────────────────────────────────────────────────────┘
```

### Server

The server runs as `orlojd` and provides:

**API Server** -- HTTP REST API for creating, reading, updating, and deleting all 13 resource types. Supports watch endpoints for real-time event streaming, optimistic concurrency via `resourceVersion` / `If-Match`, and namespace scoping. Also serves the built-in web console at `/ui/`.

**Resource Store** -- Pluggable storage backend for all resources. Two implementations:

* `memory` -- in-memory store for local development and testing. Fast, no dependencies, data is lost on restart.
* `postgres` -- PostgreSQL-backed store for production. Uses `FOR UPDATE SKIP LOCKED` for safe concurrent task claiming.

**Services** -- Background processes for each resource type: Agent, AgentSystem, ModelEndpoint, Tool, Memory, AgentPolicy, Task, TaskScheduler, TaskSchedule, and Worker. Services drive resources toward their desired state and update status fields.

**Task Scheduler** -- Matches pending tasks to available workers based on requirements (region, GPU, model), respects TaskSchedule cron triggers, and manages the assignment lifecycle.

### Workers

Workers run as `orlojworker` and execute the actual agent workflows:

**Task Worker** -- Claims assigned tasks via the lease mechanism, executes the agent graph step by step, and reports results back through status updates. Supports concurrent task execution up to `max_concurrent_tasks`.

**Model Gateway** -- Routes model requests to the appropriate provider based on the agent's `model_ref` configuration. Handles provider-specific request formatting, authentication, and response parsing for OpenAI, Anthropic, Azure OpenAI, Ollama, and mock backends.

**Tool Runtime** -- Executes tool invocations with the configured isolation backend (none, sandboxed, container, or WASM). Enforces timeouts, manages retries with capped exponential backoff and jitter, and normalizes responses into the standard tool contract envelope.

**Message Bus** -- Handles agent-to-agent communication within a task's graph. Two implementations:

* `memory` -- in-memory bus for local development.
* `nats-jetstream` -- NATS JetStream for production with durable delivery guarantees.

### Governance Layer

The governance layer is not a separate process -- it is enforced inline during worker execution:

**AgentPolicy** -- Evaluated before each agent turn. Checks `allowed_models`, `blocked_tools`, and `max_tokens_per_run`. Policies can be scoped to specific systems/tasks or applied globally.

**AgentRole + ToolPermission** -- Evaluated before each tool invocation. The worker collects the agent's permissions from all bound roles and checks them against the tool's ToolPermission requirements. Unauthorized calls return `tool_permission_denied`.

All governance decisions are deterministic and fail-closed. Denied actions produce structured errors that flow into task trace and history for auditability.

### Execution Modes

Orloj supports two execution modes. Start with sequential for development, then graduate to message-driven for production.

| Mode             | How it works                                                                                                                        | When to use                                            |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ |
| `sequential`     | The server drives execution directly in a single process. Simpler, lower latency, easy to debug.                                    | Getting started, development, single-agent systems     |
| `message-driven` | Workers consume from the message bus. Agents hand off via durable queued messages. Enables parallel fan-out and horizontal scaling. | Production, multi-agent systems, distributed workloads |

**Sequential** is the default and requires no external dependencies. Use `--embedded-worker` to run everything in one process.

**Message-driven** requires `--task-execution-mode=message-driven` and a message bus backend (`memory` for local testing, `nats-jetstream` for production). This mode provides lease-based ownership, idempotent replay, and dead-letter handling.

### Reliability Characteristics

Orloj's runtime provides several reliability guarantees:

* **Lease-based task ownership** -- Workers hold time-bounded leases on tasks. If a worker crashes, the lease expires and another worker can safely take over.
* **Owner-only message execution** -- Only the worker that holds the task lease can process messages for that task, preventing duplicate execution.
* **Idempotency tracking** -- Message idempotency keys prevent duplicate processing during replay and crash recovery.
* **Capped exponential retry with jitter** -- Both task-level and message-level retries use bounded backoff with configurable jitter to avoid thundering herds.
* **Dead-letter transitions** -- Messages and tasks that exhaust all retries move to a terminal `DeadLetter` phase for manual investigation rather than being silently dropped.

### Related Docs

* [Execution and Messaging](./execution-model.md)
* [Concepts](../concepts/)
* [Runbook](../operations/runbook.md)
* [Configuration](../operations/configuration.md)
* [Resource Reference](../reference/resources.md)


## Starter Blueprints

Blueprints are ready-to-run templates that combine agents, an agent system (the graph), and a task into a single directory. They are the fastest way to see Orloj in action and to understand each orchestration pattern.

For copy-paste **use case** bundles (YAML + README per scenario) that map these patterns to novice and enterprise-style problems, see [examples/use-cases/](../../../examples/use-cases/README.md).

### Available Patterns

#### Pipeline

Predictable stage-by-stage execution: `planner -> research -> writer`.

```bash
orlojctl apply -f examples/blueprints/pipeline/
```

#### Hierarchical

Manager-led delegation: `manager -> leads -> workers -> editor`.

```bash
orlojctl apply -f examples/blueprints/hierarchical/
```

#### Swarm and Loop

Parallel exploration with iterative coordination: `coordinator <-> scouts -> synthesizer`. Safety-bounded by `Task.spec.max_turns`.

```bash
orlojctl apply -f examples/blueprints/swarm-loop/
```

### Runtime Compatibility

Blueprints work in both execution modes:

* **Sequential** -- run with `--embedded-worker` for single-process development. Good for getting started.
* **Message-driven** -- run with `--agent-message-bus-backend=memory` (or `nats-jetstream`) and `--agent-message-consume` for distributed execution. Required for parallel fan-out in the swarm-loop pattern.

### What is Inside a Blueprint

Each blueprint directory contains:

* `agents/*.yaml` -- individual Agent resources with prompts, model config, and tool bindings.
* `agent-system.yaml` -- the AgentSystem resource defining the graph topology (nodes and edges).
* `task.yaml` -- a Task resource that targets the agent system with sample input.

Apply the entire directory with `orlojctl apply -f <path>/` to create all resources at once.