Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

EvalRun

Stability: beta -- This resource kind ships with orloj.dev/v1 and is suitable for production use, but its schema may evolve with migration guidance in future minor releases.

An EvalRun executes all samples in an EvalDataset against an AgentSystem, scores the results, and produces aggregate metrics. Runs track per-sample scores, latency, token usage, and pass/fail verdicts.

Manifest

apiVersion: orloj.dev/v1
kind: EvalRun
metadata:
  name: triage-gpt4o-20260510
  namespace: default
spec:
  dataset_ref: support-triage-golden
  system: support-triage-system
  scoring:
    strategy: llm_judge
    model_ref: gpt-4o-judge
    rubric: "Rate the accuracy and helpfulness of the triage response."
  concurrency: 5
  timeout: 120s
  agent_overrides:
    triage-agent:
      prompt: "You are a support triage bot. Classify and route."
      model_ref: gpt-4o-mini

spec

FieldTypeDescription
dataset_refstring, requiredName of the EvalDataset to evaluate against.
systemstring, requiredName of the AgentSystem to evaluate.
scoringEvalScoringConfigDefault scoring strategy for all samples. Per-sample overrides in the dataset take precedence.
concurrencyintMaximum parallel tasks (samples) to execute. Defaults to 5. Minimum 1.
timeoutduration stringPer-sample task timeout (e.g. 120s, 5m). Must be a valid Go time.Duration.
agent_overridesmap[string]AgentOverrideEphemeral overrides for agent configuration within this run. Keys are agent names.
labelsmap[string]stringArbitrary labels for filtering and comparison.
suspendedboolWhen true, the controller will not execute this run. Defaults to true when created via apply (use --run to override or orlojctl eval start to trigger later). Defaults to false when created via orlojctl eval run.

AgentOverride

Used for A/B testing models, prompts, or parameters without modifying the base agent resource. The map key is the name of the agent to override.

FieldTypeDescription
promptstringOverride the agent's system prompt.
model_refstringOverride the agent's model endpoint.

EvalScoringConfig

See EvalDataset for the full field list.

Defaults and Validation

  • apiVersion defaults to orloj.dev/v1.
  • kind defaults to EvalRun.
  • metadata.namespace defaults to default.
  • status.phase defaults to Pending.
  • spec.suspended defaults to true when created via POST /v1/eval-runs (unless ?run=true is set). orlojctl eval run sets ?run=true automatically.
  • spec.concurrency defaults to 5; must be >= 1.
  • spec.dataset_ref and spec.system are required.
  • spec.timeout must be a valid Go duration string when set.
  • llm_judge scoring requires model_ref.
  • custom scoring requires tool_ref.
  • Agent override names must be unique.

status

FieldTypeDescription
phasestringCurrent lifecycle phase (see below).
messagestringHuman-readable status message.
results[]EvalSampleResultPer-sample results.
summaryEvalSummaryAggregate metrics computed after scoring.
total_samplesintTotal number of samples in the dataset.
completed_samplesintNumber of samples with completed tasks.
scored_samplesintNumber of samples that have been scored.
errored_samplesintNumber of samples that encountered errors.

EvalSampleResult

FieldTypeDescription
sample_namestringName matching the dataset sample.
task_namestringName of the task created for this sample.
outputstringRaw output from the agent system.
score*float64Numeric score (0.0–1.0). Nil for unscored/manual.
pass*boolPass/fail verdict. Nil for unscored.
reasoningstringExplanation from the scorer (e.g. LLM judge reasoning).
latency_msint64Execution time in milliseconds.
tokens_usedintTotal tokens consumed by the task.
errorstringNon-empty if the sample errored.

EvalSummary

FieldTypeDescription
pass_ratefloat64Fraction of scored samples that passed (0.0–1.0).
mean_scorefloat64Average score across all scored samples.
total_tokensintSum of tokens across all samples.
mean_latency_msint64Average latency across all completed samples.

Lifecycle Phases

Pending ──► Running ──► Scoring ──► Succeeded
                                └──► PendingReview ──► Succeeded
                    └──► Failed
                    └──► Cancelled
PhaseMeaning
PendingCreated, waiting for the controller. If spec.suspended is true, the controller skips the run until it is started.
RunningTasks are being created and executed (up to concurrency in parallel).
ScoringAll tasks completed; scoring pipeline is evaluating results.
PendingReviewManual scoring; awaiting human annotations before finalization.
SucceededScoring complete; status.summary is final.
FailedFatal error during the run (e.g. missing dataset or system).
CancelledUser-initiated cancellation. In-flight tasks are also cancelled.

API Endpoints

MethodPathDescription
GET/v1/eval-runsList all runs.
POST/v1/eval-runsCreate a new eval run.
GET/v1/eval-runs/{name}Get a run by name.
PUT/v1/eval-runs/{name}Update a run.
DELETE/v1/eval-runs/{name}Delete a run.
GET/v1/eval-runs/{name}/exportExport results as JSON (or CSV with ?format=csv).
PUT/v1/eval-runs/{name}/results/{sample}Annotate a single sample (manual review).
POST/v1/eval-runs/{name}/resultsBulk import sample annotations.
POST/v1/eval-runs/{name}/startStart a suspended eval run.
POST/v1/eval-runs/{name}/finalizeFinalize a PendingReview run (computes summary).
POST/v1/eval-runs/{name}/cancelCancel a running eval.
GET/v1/eval-runs/compare?names=a,b,cCompare multiple runs side-by-side.

CLI Quick Reference

orlojctl eval run --dataset golden --system my-system      # Create and start a run
orlojctl eval start my-run                                  # Start a suspended run
orlojctl eval list                                          # List all runs
orlojctl eval get my-run                                    # Get run detail
orlojctl eval export my-run --format csv                    # Export for review
orlojctl eval annotate my-run --sample s1 --score 0.9      # Annotate sample
orlojctl eval import my-run -f reviewed.csv                 # Bulk import
orlojctl eval finalize my-run                               # Finalize manual run
orlojctl eval compare run-a run-b                           # Compare runs
orlojctl eval datasets                                      # List datasets
orlojctl apply -f eval-run.yaml                             # Apply (suspended by default)
orlojctl apply -f eval-run.yaml --run                       # Apply and start immediately

Related