Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

EvalDataset

Stability: beta -- This resource kind ships with orloj.dev/v1 and is suitable for production use, but its schema may evolve with migration guidance in future minor releases.

An EvalDataset is a declarative list of (input, expected output) pairs, optionally with per-sample scoring rubrics. Datasets are applied like any other resource and referenced by EvalRun resources to drive evaluations.

Manifest

apiVersion: orloj.dev/v1
kind: EvalDataset
metadata:
  name: support-triage-golden
  namespace: default
spec:
  description: "Golden set for the support triage agent"
  samples:
    - name: billing-question
      input:
        prompt: "I was charged twice for my subscription"
      expected:
        output_contains: "billing"
        output_json_path: "$.category"
        equals: "billing"
    - name: refund-request
      input:
        prompt: "I want a refund for order #12345"
      expected:
        output_contains: "refund"
      scoring:
        strategy: llm_judge
        model_ref: gpt-4o-judge
        rubric: "The response should identify this as a refund request and extract the order number."

spec

  • description (string, optional): human-readable description of the dataset.
  • samples ([]EvalSample, required, min 1): the evaluation cases.

EvalSample

FieldTypeDescription
namestring, requiredUnique name within the dataset (case-insensitive uniqueness).
inputmap[string]string, requiredInput passed to the agent system as task.spec.input. Must be non-empty.
expectedEvalExpected, optionalExpected output criteria for exact_match scoring.
scoringEvalScoringConfig, optionalPer-sample scoring override. When set, takes precedence over the run-level default.

EvalExpected

Mirrors EdgeCondition semantics. All non-empty fields must match (logical AND).

FieldTypeDescription
output_containsstringOutput must contain this substring (case-insensitive).
output_not_containsstringOutput must NOT contain this substring (case-insensitive).
output_matchesstringOutput must match this regex. Validated during normalization.
output_json_pathstringDot-notation JSON path (e.g. $.category). Requires a comparison operator.
equalsstringJSON path value must equal this string.
not_equalsstringJSON path value must NOT equal this string.
containsstringJSON path value (string or array) must contain this value.
greater_thanstringJSON path numeric value must be greater than this threshold.
less_thanstringJSON path numeric value must be less than this threshold.

EvalScoringConfig

FieldTypeDescription
strategystringOne of: exact_match, llm_judge, manual, custom.
model_refstringRequired for llm_judge. References a ModelEndpoint.
rubricstringEvaluation rubric for llm_judge.
tool_refstringRequired for custom. References a Tool.

Defaults and Validation

  • apiVersion defaults to orloj.dev/v1.
  • kind defaults to EvalDataset.
  • metadata.namespace defaults to default.
  • status.phase defaults to Ready.
  • Sample names must be unique within a dataset (case-insensitive).
  • Each sample must have at least one input entry.
  • output_matches is validated as a valid regex during normalization.
  • llm_judge strategy requires model_ref.
  • custom strategy requires tool_ref.

status

  • phase: Ready (datasets are config-only; no controller moves them through phases).

API Endpoints

MethodPathDescription
GET/v1/eval-datasetsList all datasets (supports namespace, limit, continue query params).
POST/v1/eval-datasetsCreate or update a dataset.
GET/v1/eval-datasets/{name}Get a dataset by name.
PUT/v1/eval-datasets/{name}Update a dataset.
DELETE/v1/eval-datasets/{name}Delete a dataset.

Related