Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

Real Tool Validation (Model Decision Gate)

Use this runbook to validate model-selected tool usage in an Anthropic-backed A/B scenario.

Goal

  • Task A should use a tool.
  • Task B should not use a tool.

Scenario path:

  • testing/scenarios-real/05-tool-decision

Before You Begin

  1. Add a valid Anthropic key to:
  • testing/scenarios-real/05-tool-decision/secret.yaml
  1. Ensure Docker is available for containerized tools.
  2. Ensure API server is reachable at http://localhost:8080 (or override API_BASE).
  3. Start the local deterministic stub tool service:
make real-tool-stub

Runtime Startup

Terminal 1 (server):

go run ./cmd/orlojd --task-execution-mode=message-driven --agent-message-bus-backend=memory

Terminal 2 (worker with Anthropic + containerized tools):

go run ./cmd/orlojworker \
  --task-execution-mode=message-driven \
  --agent-message-bus-backend=memory \
  --agent-message-consume \
  --model-gateway-provider=anthropic \
  --tool-isolation-backend=container \
  --tool-container-network=bridge

Apply Scenario

make real-apply-tool-decision

This applies:

  • ModelEndpoint pinned to claude-sonnet-4-20250514
  • one HTTP tool with spec.runtime.isolation_mode=container
  • one decision agent
  • two tasks:
    • rr-tool-use-task
    • rr-tool-no-use-task

The tool points at the local stub service via http://host.docker.internal:18080/tool/decision.

Run Gate

make real-gate-tool-decision

Pass/Fail Criteria

rr-tool-use-task

Pass requires all:

  • task phase is Succeeded
  • status.output["agent.1.tool_calls"] >= 1
  • at least one tool_call event in status.trace[]
  • status.output["agent.1.last_event"] contains TOOL_USED: yes and EVIDENCE:

rr-tool-no-use-task

Pass requires all:

  • task phase is Succeeded
  • status.output["agent.1.tool_calls"] == 0
  • zero tool_call events in status.trace[]
  • status.output["agent.1.last_event"] contains TOOL_USED: no and EVIDENCE: self-contained-input

Reliability Target

Pass make real-gate-tool-decision five consecutive times.

Manual Inspection

make real-check NS=rr-real-tool-decision TASK=rr-tool-use-task
make real-check NS=rr-real-tool-decision TASK=rr-tool-no-use-task