EvalRun
Stability: beta -- This resource kind ships with
orloj.dev/v1and is suitable for production use, but its schema may evolve with migration guidance in future minor releases.
An EvalRun executes all samples in an EvalDataset against an AgentSystem, scores the results, and produces aggregate metrics. Runs track per-sample scores, latency, token usage, and pass/fail verdicts.
Manifest
apiVersion: orloj.dev/v1
kind: EvalRun
metadata:
name: triage-gpt4o-20260510
namespace: default
spec:
dataset_ref: support-triage-golden
system: support-triage-system
scoring:
strategy: llm_judge
model_ref: gpt-4o-judge
rubric: "Rate the accuracy and helpfulness of the triage response."
concurrency: 5
timeout: 120s
agent_overrides:
triage-agent:
prompt: "You are a support triage bot. Classify and route."
model_ref: gpt-4o-minispec
| Field | Type | Description |
|---|---|---|
dataset_ref | string, required | Name of the EvalDataset to evaluate against. |
system | string, required | Name of the AgentSystem to evaluate. |
scoring | EvalScoringConfig | Default scoring strategy for all samples. Per-sample overrides in the dataset take precedence. |
concurrency | int | Maximum parallel tasks (samples) to execute. Defaults to 5. Minimum 1. |
timeout | duration string | Per-sample task timeout (e.g. 120s, 5m). Must be a valid Go time.Duration. |
agent_overrides | map[string]AgentOverride | Ephemeral overrides for agent configuration within this run. Keys are agent names. |
labels | map[string]string | Arbitrary labels for filtering and comparison. |
suspended | bool | When true, the controller will not execute this run. Defaults to true when created via apply (use --run to override or orlojctl eval start to trigger later). Defaults to false when created via orlojctl eval run. |
AgentOverride
Used for A/B testing models, prompts, or parameters without modifying the base agent resource. The map key is the name of the agent to override.
| Field | Type | Description |
|---|---|---|
prompt | string | Override the agent's system prompt. |
model_ref | string | Override the agent's model endpoint. |
EvalScoringConfig
See EvalDataset for the full field list.
Defaults and Validation
apiVersiondefaults toorloj.dev/v1.kinddefaults toEvalRun.metadata.namespacedefaults todefault.status.phasedefaults toPending.spec.suspendeddefaults totruewhen created viaPOST /v1/eval-runs(unless?run=trueis set).orlojctl eval runsets?run=trueautomatically.spec.concurrencydefaults to 5; must be >= 1.spec.dataset_refandspec.systemare required.spec.timeoutmust be a valid Go duration string when set.llm_judgescoring requiresmodel_ref.customscoring requirestool_ref.- Agent override names must be unique.
status
| Field | Type | Description |
|---|---|---|
phase | string | Current lifecycle phase (see below). |
message | string | Human-readable status message. |
results | []EvalSampleResult | Per-sample results. |
summary | EvalSummary | Aggregate metrics computed after scoring. |
total_samples | int | Total number of samples in the dataset. |
completed_samples | int | Number of samples with completed tasks. |
scored_samples | int | Number of samples that have been scored. |
errored_samples | int | Number of samples that encountered errors. |
EvalSampleResult
| Field | Type | Description |
|---|---|---|
sample_name | string | Name matching the dataset sample. |
task_name | string | Name of the task created for this sample. |
output | string | Raw output from the agent system. |
score | *float64 | Numeric score (0.0–1.0). Nil for unscored/manual. |
pass | *bool | Pass/fail verdict. Nil for unscored. |
reasoning | string | Explanation from the scorer (e.g. LLM judge reasoning). |
latency_ms | int64 | Execution time in milliseconds. |
tokens_used | int | Total tokens consumed by the task. |
error | string | Non-empty if the sample errored. |
EvalSummary
| Field | Type | Description |
|---|---|---|
pass_rate | float64 | Fraction of scored samples that passed (0.0–1.0). |
mean_score | float64 | Average score across all scored samples. |
total_tokens | int | Sum of tokens across all samples. |
mean_latency_ms | int64 | Average latency across all completed samples. |
Lifecycle Phases
Pending ──► Running ──► Scoring ──► Succeeded
└──► PendingReview ──► Succeeded
└──► Failed
└──► Cancelled| Phase | Meaning |
|---|---|
Pending | Created, waiting for the controller. If spec.suspended is true, the controller skips the run until it is started. |
Running | Tasks are being created and executed (up to concurrency in parallel). |
Scoring | All tasks completed; scoring pipeline is evaluating results. |
PendingReview | Manual scoring; awaiting human annotations before finalization. |
Succeeded | Scoring complete; status.summary is final. |
Failed | Fatal error during the run (e.g. missing dataset or system). |
Cancelled | User-initiated cancellation. In-flight tasks are also cancelled. |
API Endpoints
| Method | Path | Description |
|---|---|---|
GET | /v1/eval-runs | List all runs. |
POST | /v1/eval-runs | Create a new eval run. |
GET | /v1/eval-runs/{name} | Get a run by name. |
PUT | /v1/eval-runs/{name} | Update a run. |
DELETE | /v1/eval-runs/{name} | Delete a run. |
GET | /v1/eval-runs/{name}/export | Export results as JSON (or CSV with ?format=csv). |
PUT | /v1/eval-runs/{name}/results/{sample} | Annotate a single sample (manual review). |
POST | /v1/eval-runs/{name}/results | Bulk import sample annotations. |
POST | /v1/eval-runs/{name}/start | Start a suspended eval run. |
POST | /v1/eval-runs/{name}/finalize | Finalize a PendingReview run (computes summary). |
POST | /v1/eval-runs/{name}/cancel | Cancel a running eval. |
GET | /v1/eval-runs/compare?names=a,b,c | Compare multiple runs side-by-side. |
CLI Quick Reference
orlojctl eval run --dataset golden --system my-system # Create and start a run
orlojctl eval start my-run # Start a suspended run
orlojctl eval list # List all runs
orlojctl eval get my-run # Get run detail
orlojctl eval export my-run --format csv # Export for review
orlojctl eval annotate my-run --sample s1 --score 0.9 # Annotate sample
orlojctl eval import my-run -f reviewed.csv # Bulk import
orlojctl eval finalize my-run # Finalize manual run
orlojctl eval compare run-a run-b # Compare runs
orlojctl eval datasets # List datasets
orlojctl apply -f eval-run.yaml # Apply (suspended by default)
orlojctl apply -f eval-run.yaml --run # Apply and start immediatelyRelated
- EvalDataset -- define golden test data
- Agent Evaluation (concept) -- overview and workflow
- Guide: Run Your First Agent Evaluation
- AgentSystem -- the systems being evaluated