EvalDataset
Stability: beta -- This resource kind ships with
orloj.dev/v1and is suitable for production use, but its schema may evolve with migration guidance in future minor releases.
An EvalDataset is a declarative list of (input, expected output) pairs, optionally with per-sample scoring rubrics. Datasets are applied like any other resource and referenced by EvalRun resources to drive evaluations.
Manifest
apiVersion: orloj.dev/v1
kind: EvalDataset
metadata:
name: support-triage-golden
namespace: default
spec:
description: "Golden set for the support triage agent"
samples:
- name: billing-question
input:
prompt: "I was charged twice for my subscription"
expected:
output_contains: "billing"
output_json_path: "$.category"
equals: "billing"
- name: refund-request
input:
prompt: "I want a refund for order #12345"
expected:
output_contains: "refund"
scoring:
strategy: llm_judge
model_ref: gpt-4o-judge
rubric: "The response should identify this as a refund request and extract the order number."spec
description(string, optional): human-readable description of the dataset.samples([]EvalSample, required, min 1): the evaluation cases.
EvalSample
| Field | Type | Description |
|---|---|---|
name | string, required | Unique name within the dataset (case-insensitive uniqueness). |
input | map[string]string, required | Input passed to the agent system as task.spec.input. Must be non-empty. |
expected | EvalExpected, optional | Expected output criteria for exact_match scoring. |
scoring | EvalScoringConfig, optional | Per-sample scoring override. When set, takes precedence over the run-level default. |
EvalExpected
Mirrors EdgeCondition semantics. All non-empty fields must match (logical AND).
| Field | Type | Description |
|---|---|---|
output_contains | string | Output must contain this substring (case-insensitive). |
output_not_contains | string | Output must NOT contain this substring (case-insensitive). |
output_matches | string | Output must match this regex. Validated during normalization. |
output_json_path | string | Dot-notation JSON path (e.g. $.category). Requires a comparison operator. |
equals | string | JSON path value must equal this string. |
not_equals | string | JSON path value must NOT equal this string. |
contains | string | JSON path value (string or array) must contain this value. |
greater_than | string | JSON path numeric value must be greater than this threshold. |
less_than | string | JSON path numeric value must be less than this threshold. |
EvalScoringConfig
| Field | Type | Description |
|---|---|---|
strategy | string | One of: exact_match, llm_judge, manual, custom. |
model_ref | string | Required for llm_judge. References a ModelEndpoint. |
rubric | string | Evaluation rubric for llm_judge. |
tool_ref | string | Required for custom. References a Tool. |
Defaults and Validation
apiVersiondefaults toorloj.dev/v1.kinddefaults toEvalDataset.metadata.namespacedefaults todefault.status.phasedefaults toReady.- Sample names must be unique within a dataset (case-insensitive).
- Each sample must have at least one input entry.
output_matchesis validated as a valid regex during normalization.llm_judgestrategy requiresmodel_ref.customstrategy requirestool_ref.
status
phase:Ready(datasets are config-only; no controller moves them through phases).
API Endpoints
| Method | Path | Description |
|---|---|---|
GET | /v1/eval-datasets | List all datasets (supports namespace, limit, continue query params). |
POST | /v1/eval-datasets | Create or update a dataset. |
GET | /v1/eval-datasets/{name} | Get a dataset by name. |
PUT | /v1/eval-datasets/{name} | Update a dataset. |
DELETE | /v1/eval-datasets/{name} | Delete a dataset. |
Related
- EvalRun -- runs a dataset against an agent system
- Agent Evaluation (concept) -- overview of the evaluation framework
- Guide: Run Your First Agent Evaluation