API reference: Scorers lists every scorer and its options, and links to source and tests.
workflow.tsx) and a working bunx smithers-orchestrator init scaffold. Each eval case runs that workflow with the given input and checks the result.
1. Create cases
Createevals/smoke.jsonl:
cases array:
expected checks:
status: one offinished,continued,failed,cancelled,waiting-approval,waiting-event, orwaiting-timeroutput: exact JSON match against the workflow result outputoutputContains: recursive partial JSON match (array elements are matched by containment, so at least one array element must match the partial object)errorContains: substring match against thrown errors
2. Dry-run the plan
--run-label <label> when you want the dry-run and execution to use the same generated IDs.
3. Execute the suite
.smithers/evals/smoke.json. Use --report path/to/report.json to choose a different location.
The command exits 0 when all cases pass and 1 when any case fails. Invalid case files exit with 4.
Tip: You can use a workflow ID (implement) instead of a file path when running discovered workflows from.smithers/workflows. This lets the same suite run on every checkout without hard-coding entry file paths.
4. Use structured output in CI
Options that matter in production
--concurrency N: run multiple cases at once; keep this low for stateful or expensive workflows.--run-label LABEL: append a stable label to run IDs, useful for CI build IDs or benchmark names.--max-concurrency N: pass a per-workflow task concurrency cap to each case.--max-cases N: shard or sample a large suite.--no-include-output: omit workflow outputs from the report when outputs are too large or sensitive.--allow-network: enable network access for bash tools in cases that need it.--root PATH: set the sandbox root for tool execution.