Run Your First Agent Evaluation
This guide walks you through creating a golden dataset, running an evaluation against an agent system, and comparing results across different configurations. By the end, you will have a repeatable evaluation workflow for measuring agent quality.
Prerequisites
- Orloj server (
orlojd) running (sequential mode with--embedded-workeris fine for this guide) orlojctlavailable (orgo run ./cmd/orlojctl)- At least one AgentSystem and ModelEndpoint already applied
If you have not set up Orloj yet, follow the Install and Quickstart guides first.
What You Will Build
A golden dataset that tests a support triage agent, two eval runs (one per model), and a comparison showing which model performs better.
Dataset ──► EvalRun (gpt-4o) ──► ┐
├──► Compare
Dataset ──► EvalRun (claude) ──► ┘Step 1: Define a Golden Dataset
A dataset contains sample inputs and expected outputs. Create triage-dataset.yaml:
apiVersion: orloj.dev/v1
kind: EvalDataset
metadata:
name: triage-golden
spec:
description: "Golden set for the support triage agent"
samples:
- name: billing-question
input:
prompt: "I was charged twice for my subscription last month"
expected:
output_contains: "billing"
- name: password-reset
input:
prompt: "I can't log in, I forgot my password"
expected:
output_contains: "password"
output_not_contains: "billing"
- name: refund-request
input:
prompt: "I want a refund for order #12345, the item arrived broken"
expected:
output_contains: "refund"
- name: feature-request
input:
prompt: "It would be great if you could add dark mode"
expected:
output_contains: "feature"Apply it:
orlojctl apply -f triage-dataset.yamlVerify:
orlojctl eval datasetsNAME NAMESPACE SAMPLES DESCRIPTION
triage-golden default 4 Golden set for the support triage agentStep 2: Run an Evaluation with exact_match
The simplest scoring strategy is exact_match, which checks expected output fields against the agent's actual output. Start a run:
orlojctl eval run \
--dataset triage-golden \
--system support-triage-systemThe CLI creates an EvalRun resource and polls until completion. You will see output like:
EvalRun triage-golden-run-a1b2c3 created (Pending)
Phase: Running (2/4 samples completed)
Phase: Scoring...
EvalRun triage-golden-run-a1b2c3: Succeeded
Pass Rate: 75.0% Mean Score: 0.750 Tokens: 2340 Latency: 1.8s
SAMPLE PASS SCORE LATENCY
billing-question ✓ 1.000 1.2s
password-reset ✓ 1.000 1.5s
refund-request ✓ 1.000 2.1s
feature-request ✗ 0.000 2.4sYou can also create runs declaratively:
apiVersion: orloj.dev/v1
kind: EvalRun
metadata:
name: triage-eval-exact
spec:
dataset_ref: triage-golden
system: support-triage-system
scoring:
strategy: exact_match
concurrency: 2
timeout: 60sorlojctl apply -f eval-run.yaml # creates the run in suspended state
orlojctl eval start triage-eval-exact # start it when ready
# or apply and start immediately:
orlojctl apply -f eval-run.yaml --runStep 3: Run with LLM-as-Judge
For subjective quality assessment, use llm_judge. The judge model evaluates each sample against a rubric:
orlojctl eval run \
--dataset triage-golden \
--system support-triage-system \
--scoring llm_judge \
--model-ref gpt-4o-judge \
--rubric "Rate accuracy and helpfulness of the triage classification (0-1)."Or declaratively:
apiVersion: orloj.dev/v1
kind: EvalRun
metadata:
name: triage-eval-llm
spec:
dataset_ref: triage-golden
system: support-triage-system
scoring:
strategy: llm_judge
model_ref: gpt-4o-judge
rubric: "Rate accuracy and helpfulness of the triage classification (0-1)."
concurrency: 4
timeout: 120sThe judge model must be configured as a ModelEndpoint like any other model.
Step 4: A/B Test with Agent Overrides
Compare two models without modifying your agents. Use agent_overrides to swap the model for a specific agent:
apiVersion: orloj.dev/v1
kind: EvalRun
metadata:
name: triage-eval-claude
spec:
dataset_ref: triage-golden
system: support-triage-system
scoring:
strategy: llm_judge
model_ref: gpt-4o-judge
rubric: "Rate accuracy and helpfulness of the triage classification (0-1)."
concurrency: 4
agent_overrides:
triage-agent:
model_ref: claude-sonnetApply and start:
orlojctl apply -f eval-claude.yaml --run
orlojctl eval get triage-eval-claude -wStep 5: Compare Runs
Compare any number of completed runs side-by-side:
orlojctl eval compare triage-eval-llm triage-eval-claudeMETRIC triage-eval-llm triage-eval-claude
Pass Rate 100.0% 75.0%
Mean Score 0.920 0.850
Tokens 4200 3800
Mean Latency 2.1s 1.8sThe comparison API also returns per-sample deltas accessible via the REST API:
curl "http://localhost:8080/v1/eval-runs/compare?names=triage-eval-llm,triage-eval-claude"Step 6: Manual Review (Optional)
For subjective tasks where automated scoring is not appropriate, use manual scoring:
orlojctl eval run \
--dataset triage-golden \
--system support-triage-system \
--scoring manualOnce the tasks complete, the run transitions to PendingReview. Export the results:
orlojctl eval export triage-golden-run-xyz --format csv > review.csvThe CSV contains columns: sample_name, input, output, score, pass, reasoning. Fill in the score, pass, and reasoning columns, then import:
orlojctl eval import triage-golden-run-xyz -f review.csvFinalize to compute aggregate metrics:
orlojctl eval finalize triage-golden-run-xyzStep 7: Inspect Results
Get full details for any run:
orlojctl eval get triage-eval-llm -o yamlList all runs:
orlojctl eval listNAME DATASET SYSTEM PHASE PASS RATE SAMPLES
triage-eval-exact triage-golden support-triage-system Succeeded 75.0% 4
triage-eval-llm triage-golden support-triage-system Succeeded 100.0% 4
triage-eval-claude triage-golden support-triage-system Succeeded 75.0% 4Tips
- Start small: begin with
exact_matchon 5–10 samples to validate the workflow before scaling up. - Use concurrency wisely: higher concurrency runs faster but consumes more model quota. Start with 2–4.
- Version your datasets: use descriptive names like
triage-golden-v2to track dataset evolution. - Automate with TaskSchedule: create a TaskSchedule or webhook to run evaluations on every deployment.
- Export for dashboards: pipe
orlojctl eval exportoutput to your observability stack for trend tracking.
Related
- Agent Evaluation (concept) -- framework overview
- EvalDataset reference -- full spec documentation
- EvalRun reference -- full spec documentation
- CLI Reference:
orlojctl eval-- all eval subcommands