Eval Suites Quickstart

Use eval suites when a workflow is important enough to protect with repeatable cases. Each case gets a persisted run record, stable case metadata, and a JSON report that can be checked in CI.

API reference: Scorers lists every scorer and its options, and links to source and tests.

You need a workflow file (e.g. workflow.tsx) and a working bunx smithers-orchestrator init scaffold. Each eval case runs that workflow with the given input and checks the result.

1. Create cases

Create evals/smoke.jsonl:

{"id":"simple-request","input":{"prompt":"Summarize the repository"},"expected":{"status":"finished"}}
{"id":"structured-output","input":{"prompt":"Return the key risks"},"expected":{"status":"finished","outputContains":{"analysis":[{"riskLevel":"low"}]}}}

Case files can be JSONL, a JSON array, or an object with a cases array:

{
  "cases": [
    {
      "id": "happy-path",
      "input": { "prompt": "Draft a release note" },
      "annotations": { "area": "release" },
      "expected": {
        "status": "finished",
        "outputContains": { "analysis": [{ "riskLevel": "low" }] }
      }
    }
  ]
}

Supported expected checks:

status: one of finished, continued, failed, cancelled, waiting-approval, waiting-event, or waiting-timer
output: exact JSON match against the workflow result output
outputContains: recursive partial JSON match (array elements are matched by containment, so at least one array element must match the partial object)
errorContains: substring match against thrown errors

For standard workflows, output assertions match the persisted output snapshot, keyed by output name.

2. Dry-run the plan

bunx smithers-orchestrator eval workflow.tsx --cases evals/smoke.jsonl --suite smoke --dry-run

Dry-run mode prints the planned case IDs and run IDs without touching the database. Pass --run-label <label> when you want the dry-run and execution to use the same generated IDs.

3. Execute the suite

bunx smithers-orchestrator eval workflow.tsx --cases evals/smoke.jsonl --suite smoke --force

By default, the report is written to .smithers/evals/smoke.json. Use --report path/to/report.json to choose a different location. The command exits 0 when all cases pass and 1 when any case fails. Invalid case files exit with 4.

Tip: You can use a workflow ID (implement) instead of a file path when running discovered workflows from .smithers/workflows. This lets the same suite run on every checkout without hard-coding entry file paths.

4. Use structured output in CI

bunx smithers-orchestrator eval workflow.tsx \
  --cases evals/smoke.jsonl \
  --suite smoke \
  --report artifacts/smoke-eval.json \
  --force \
  --format json

The JSON payload includes the suite summary, per-case assertions, run IDs, inputs, outputs, errors, and report path.

Options that matter in production

--concurrency N: run multiple cases at once; keep this low for stateful or expensive workflows.
--run-label LABEL: append a stable label to run IDs, useful for CI build IDs or benchmark names.
--max-concurrency N: pass a per-workflow task concurrency cap to each case.
--max-cases N: shard or sample a large suite.
--no-include-output: omit workflow outputs from the report when outputs are too large or sensitive.
--allow-network: enable network access for bash tools in cases that need it.
--root PATH: set the sandbox root for tool execution.

​1. Create cases

​2. Dry-run the plan

​3. Execute the suite

​4. Use structured output in CI

​Options that matter in production

1. Create cases

2. Dry-run the plan

3. Execute the suite

4. Use structured output in CI

Options that matter in production