Agent Evaluation
Orloj includes a built-in evaluation framework for measuring and comparing agent system quality. Define golden datasets as declarative YAML, run them against your agent systems, score the results with multiple strategies (programmatic, LLM-as-judge, or human review), and compare runs side-by-side.
Why Evaluate?
Model changes, prompt edits, tool updates, and graph topology changes can all silently degrade agent behavior. An evaluation framework lets you:
- Detect regressions before they reach production by running golden datasets against every change.
- Compare configurations (models, prompts, topologies) with objective metrics.
- Involve humans in scoring subjective output quality via an export/review/import workflow.
- Track quality over time with pass rates, mean scores, and latency trends.
How It Works
The evaluation framework introduces two resource kinds:
| Resource | Purpose |
|---|---|
| EvalDataset | A list of (input, expected output) pairs with optional per-sample scoring rubrics. |
| EvalRun | A single evaluation execution: run a dataset against an agent system and collect scores. |
The workflow is:
1. Define a dataset ──► orlojctl apply -f dataset.yaml
2. Start an eval run ──► orlojctl eval run --dataset golden --system my-system
3. Controller creates ──► one Task per sample, respecting concurrency limits
4. Workers execute tasks ──► normal agent system execution
5. Scoring pipeline runs ──► exact_match, llm_judge, manual, or custom
6. Results aggregated ──► pass rate, mean score, latency, tokens
7. Compare runs ──► orlojctl eval compare run-a run-bScoring Strategies
Each sample can be scored with one of four strategies:
exact_match
Compares agent output against expected fields using the same matching logic as graph edge conditions: output_contains, output_matches (regex), output_json_path with comparison operators. Binary 0/1 score.
expected:
output_contains: "billing"
output_json_path: "$.category"
equals: "billing"llm_judge
Sends the input, agent output, and a rubric to a judge model. The judge returns a score (0.0–1.0), pass/fail verdict, and reasoning. This is a single model call per sample, not a full agent execution.
scoring:
strategy: llm_judge
model_ref: gpt-4o-judge
rubric: "The response should correctly identify the customer's intent."manual
Tasks execute and outputs are collected, but no automated scoring occurs. The run transitions to PendingReview and results can be exported for human review (CSV or JSON), annotated via CLI or API, and then finalized.
scoring:
strategy: manualcustom
Invokes an external Tool with the sample input, expected output, and actual output. The tool returns a JSON score object. This lets you plug in any custom scoring logic.
scoring:
strategy: custom
tool_ref: my-custom-scorerEvalRun Lifecycle
Pending ──► Running ──► Scoring ──► Succeeded
└──► PendingReview ──► Succeeded (after finalize)
└──► Failed
└──► Cancelled| Phase | Meaning |
|---|---|
Pending | Run is created, waiting for the controller to start task creation. |
Running | Tasks are being created and executed. |
Scoring | All tasks completed; the scoring pipeline is running. |
PendingReview | Manual scoring: waiting for human annotations. |
Succeeded | All scoring complete; summary metrics computed. |
Failed | The eval run encountered a fatal error (e.g., missing dataset). |
Cancelled | Cancelled by user; in-flight tasks are also cancelled. |
Manual Review Workflow
When using manual scoring (or a mix where some samples use it):
- Tasks execute normally and outputs are collected.
- The run enters PendingReview once all tasks complete.
- Export results for review:
orlojctl eval export my-run --format csv > results.csv - Annotate individual samples:
orlojctl eval annotate my-run --sample billing-q --score 0.8 --pass --comment "Good" - Bulk import from a reviewed CSV:
orlojctl eval import my-run -f reviewed.csv - Finalize to compute aggregates and transition to Succeeded:
orlojctl eval finalize my-run
Comparing Runs
The comparison API and CLI show side-by-side metrics across multiple runs:
orlojctl eval compare run-gpt4o run-claude run-geminiMETRIC run-gpt4o run-claude run-gemini
Pass Rate 88.0% 85.0% 82.0%
Mean Score 0.910 0.890 0.850
Tokens 13100 11800 12200This is the primary tool for A/B testing model changes, prompt revisions, or topology experiments. Use agent_overrides in the EvalRun spec to swap prompts or models without modifying the base agent resources.
Related
- EvalDataset reference -- full spec documentation
- EvalRun reference -- full spec documentation
- Guide: Run Your First Agent Evaluation -- step-by-step tutorial
- CLI:
orlojctl eval-- command reference