Live Validation Matrix
Use this runbook to exercise Orloj with real model providers and a deterministic local tool stub before open source release.
Purpose
The automated Go test suite proves core correctness, but the live-validation matrix is where we check:
- real provider behavior
- message-driven execution
- tool isolation with real HTTP calls
- memory-backed agent workflows
- governance deny paths
- trigger paths through webhooks and schedules
Before You Start
- Run the automated baseline:
go test ./...- Start
orlojd:
go run ./cmd/orlojd --task-execution-mode=message-driven --agent-message-bus-backend=memory- Start the worker for your lane:
Anthropic, model-only:
go run ./cmd/orlojworker \
--task-execution-mode=message-driven \
--agent-message-bus-backend=memory \
--agent-message-consume \
--model-gateway-provider=anthropicAnthropic, tool-backed:
go run ./cmd/orlojworker \
--task-execution-mode=message-driven \
--agent-message-bus-backend=memory \
--agent-message-consume \
--model-gateway-provider=anthropic \
--tool-isolation-backend=container \
--tool-container-network=bridge- Start the deterministic stub service:
make real-tool-stub- Replace all
replace-meprovider secrets intesting/scenarios-real/.
Important readiness rule:
- Keep
orlojdand the matchingorlojworkerrunning before anymake real-apply-*ormake real-gate-*command. If they are not up, tasks can fail immediately or stall. - Quick check:
curl -sf http://localhost:8080/healthz >/dev/nullshould exit 0 before running gates.
Matrix Overview
Wave 0
make real-gate-pipelinemake real-gate-hiermake real-gate-loopmake real-gate-toolmake real-gate-tool-decision
Wave 1
make real-gate-memory-sharedmake real-gate-memory-reuse
Wave 2
make real-gate-tool-authmake real-gate-governance-denymake real-gate-tool-retry
Wave 3
make real-gate-webhookmake real-gate-schedule
Contract Enforcement Notes
Scenario 08-tool-auth-and-contract uses execution.profile: contract with on_contract_violation: observe. This means:
- The agent's tool sequence is tracked and violations are logged as
agent_contract_violationevents in the task trace. - Violations do not deadletter the task; the agent continues to completion.
- Duplicate tool calls are short-circuited (cached result reused) in all scenarios, including
04-tool-call-smokewhich usesprofile: dynamic. - Tool results use the provider's native structured tool calling protocol (
role: "tool"withtool_call_idfor OpenAI,tool_resultcontent blocks for Anthropic), preventing models from re-calling tools. - Pipeline stages can use
tool_use_behavior: stop_on_first_toolto exit immediately after the first successful tool call (1 model call + 1 tool call total).
If a gate deadletters unexpectedly, check whether on_contract_violation is set to non_retryable_error. Switch to observe to collect telemetry without disrupting the flow.
Acceptance Targets
- Run every Wave 0 and Wave 1 scenario 3 times:
make real-repeat TARGET=real-gate-pipeline COUNT=3
make real-repeat TARGET=real-gate-memory-shared COUNT=3- Run governance and tool-decision scenarios 5 times:
make real-repeat TARGET=real-gate-tool-decision COUNT=5
make real-repeat TARGET=real-gate-governance-deny COUNT=5Deterministic Tool Stub
The local stub service lives at:
- host:
http://127.0.0.1:18080 - container-accessible:
http://host.docker.internal:18080
Supported paths:
/tool/smoke/tool/decision/tool/auth/tool/retry-once
This avoids public echo services and gives stable auth/retry assertions.
Artifact Convention
Every gate captures artifacts under:
testing/artifacts/real/<namespace>/<task>/<timestamp>/Files:
task.jsonmessages.jsonmetrics.jsonmemory-<name>.jsonfor memory-backed scenariosverdict.txt
UI Validation Checklist
After a gate passes, inspect /ui/ and confirm:
- task trace is readable and includes the expected step sequence
- tool calls are visible for tool-backed scenarios
- memory entries are visible on the Memory detail page
- deny/failure paths are understandable without reading source code
Troubleshooting
secret placeholder detected: replacereplace-mein the scenario secret.tool container cannot reach stub: start the worker with--tool-container-network=bridgeand keep the stub on port18080.webhook has not created a task yet: check the signing secret and confirm the delivery returned HTTP202.schedule has not created a task yet: give the minute-level schedule up to 120 seconds and confirmorlojdis reconciling schedules.