Real Tool Validation (Model Decision Gate)
Use this runbook to validate model-selected tool usage in an Anthropic-backed A/B scenario.
Goal
- Task A should use a tool.
- Task B should not use a tool.
Scenario path:
testing/scenarios-real/05-tool-decision
Before You Begin
- Add a valid Anthropic key to:
testing/scenarios-real/05-tool-decision/secret.yaml
- Ensure Docker is available for containerized tools.
- Ensure API server is reachable at
http://localhost:8080(or overrideAPI_BASE). - Start the local deterministic stub tool service:
make real-tool-stubRuntime Startup
Terminal 1 (server):
go run ./cmd/orlojd --task-execution-mode=message-driven --agent-message-bus-backend=memoryTerminal 2 (worker with Anthropic + containerized tools):
go run ./cmd/orlojworker \
--task-execution-mode=message-driven \
--agent-message-bus-backend=memory \
--agent-message-consume \
--model-gateway-provider=anthropic \
--tool-isolation-backend=container \
--tool-container-network=bridgeApply Scenario
make real-apply-tool-decisionThis applies:
ModelEndpointpinned toclaude-sonnet-4-20250514- one HTTP tool with
spec.runtime.isolation_mode=container - one decision agent
- two tasks:
rr-tool-use-taskrr-tool-no-use-task
The tool points at the local stub service via http://host.docker.internal:18080/tool/decision.
Run Gate
make real-gate-tool-decisionPass/Fail Criteria
rr-tool-use-task
Pass requires all:
- task phase is
Succeeded status.output["agent.1.tool_calls"] >= 1- at least one
tool_callevent instatus.trace[] status.output["agent.1.last_event"]containsTOOL_USED: yesandEVIDENCE:
rr-tool-no-use-task
Pass requires all:
- task phase is
Succeeded status.output["agent.1.tool_calls"] == 0- zero
tool_callevents instatus.trace[] status.output["agent.1.last_event"]containsTOOL_USED: noandEVIDENCE: self-contained-input
Reliability Target
Pass make real-gate-tool-decision five consecutive times.
Manual Inspection
make real-check NS=rr-real-tool-decision TASK=rr-tool-use-task
make real-check NS=rr-real-tool-decision TASK=rr-tool-no-use-task