Run State - Smithers

Every Smithers surface (CLI, Gateway, and programmatic clients) answers “what is this run doing right now?” by reading the same RunStateView, computed server-side from persisted state plus liveness signals. Surfaces never infer status from ps, event absence, or partial table reads. They call computeRunState.

RunState

type RunState =
  | "running"           // owner alive, work progressing
  | "waiting-approval"  // blocked on a human decision
  | "waiting-event"     // blocked on an external signal
  | "waiting-timer"     // blocked on a scheduled wakeup
  | "recovering"        // supervisor replaying / resuming
  | "stale"             // owner heartbeat expired, not yet recovered
  | "orphaned"          // owner gone, no supervisor candidate
  | "failed"
  | "cancelled"
  | "succeeded"
  | "unknown"           // telemetry gap; never invent a state

When state cannot be determined (e.g. a missing heartbeat with ambiguous DB status), the runtime returns unknown. Never invent a more specific state. The legacy run-row status column maps to RunState like this:

`_smithers_runs.status`	`RunState`
`running` (fresh)	`running`
`running` (stale)	`stale` or `orphaned`
`waiting-approval`	`waiting-approval`
`waiting-event`	`waiting-event`
`waiting-timer`	`waiting-timer`
`finished`	`succeeded`
`continued`	`succeeded`
`failed`	`failed`
`cancelled`	`cancelled`
missing / unknown text	`unknown`

recovering is reserved for the supervisor takeover window. As of this writing, it is defined but not yet emitted by the runtime.

waiting-event is overloaded; do not assume it always means “send a signal.” The same status covers two situations that need opposite recovery:

A genuine pause on a <Signal> or <WaitForEvent>. Recover by delivering the awaited signal: bunx smithers-orchestrator signal RUN_ID SIGNAL_NAME --data '{}'.
A run parked with no live event node: an external-trigger / hot-reload / orphan-recovery wait, an aspect or alert human-request, or an approval decided while the run was detached. Here a signal does nothing; the run needs a resume (or retry-task / fork / answering the human request) after you fix the underlying cause.

Do not guess between them. Ask the runtime, which already disambiguates on RunStateView.blocked.kind:

bunx smithers-orchestrator why RUN_ID     # names the real blocker and prints the exact unblock command
bunx smithers-orchestrator inspect RUN_ID # includes runState.blocked.kind

runState.blocked.kind === "event" means a node is truly awaiting an event. When the run row says waiting-event but no node is, runState.blocked.kind is approval-decided-resume-required for a detached approval continuation, or external-trigger for a generic parked external wait. why still provides the most detailed unblock command and may report other blockers such as retries-exhausted or dependency-failed.

succeeded/finished can mask failed child agents. Run-level status is binary; finished or failed; and a run is only failed when it has an unhandled task failure. Two large classes of failed children are deliberately not treated as run-level failures: tasks marked continueOnFail, and agent tasks that fail with a transient error (rate limits, timeouts, aborts; SESSION_ERROR, TASK_TIMEOUT, TASK_HEARTBEAT_TIMEOUT, TASK_ABORTED, or failureRetryable). A fan-out where most agents hit provider rate limits can therefore still report finished → succeeded.Never trust the top-level status alone to conclude a run truly succeeded. Inspect per-node/agent outcomes:

bunx smithers-orchestrator inspect RUN_ID   # top-level runState plus every node's state (look for failed nodes)
bunx smithers-orchestrator node NODE_ID      # one node's failure/error, retries, and tool calls
bunx smithers-orchestrator events --run RUN_ID  # the full event history, including per-task failures
bunx smithers-orchestrator scores RUN_ID     # scorer results when you gate on judged output

You don’t have to re-derive the count by eye. When a finished run tolerated at least one failed child, the masking is surfaced as a first-class failedChildren signal in three places:

inspect adds failedChildren (a count) and failedChildKeys (the failed task state keys, nodeId::iteration) to its JSON next to runState, plus a node … call-to-action. They are absent when nothing failed, so failedChildren > 0 is the degraded-run gate.
The RunResult returned by a programmatic run carries the same failedChildren / failedChildKeys fields (omitted when zero).
The RunFinished event row records failedChildren / failedChildKeys, so the CLI, Gateway, and DevTools can flag the degraded outcome from the event stream without re-reading every node row.

The run-row status stays finished (and RunState stays succeeded) so existing finished === success callers and continueOnFail semantics are unchanged. The count is the signal, not a new terminal status.To avoid the rate-limit failures in the first place, bound fan-out concurrency with <Parallel maxConcurrency={N}> so you don’t burst past the provider’s limit.

ReasonBlocked / ReasonUnhealthy

ReasonBlocked and ReasonUnhealthy are optional reason payloads for waiting and unhealthy states. The type unions are wider than the variants currently derived from the DB rows.

type ReasonBlocked =
  | { kind: "approval"; nodeId: string; requestedAt: string }
  | { kind: "event";    nodeId: string; correlationKey: string }
  | { kind: "timer";    nodeId: string; wakeAt: string }
  | { kind: "approval-decided-resume-required"; nodeId: string }
  | { kind: "external-trigger" }
  | { kind: "provider"; nodeId: string; code: "rate-limit" | "auth" | "timeout" }
  | { kind: "tool";     nodeId: string; toolName: string; code: string }

type ReasonUnhealthy =
  | { kind: "engine-heartbeat-stale"; lastHeartbeatAt: string }
  | { kind: "ui-heartbeat-stale";     lastSeenAt: string }
  | { kind: "db-lock" }
  | { kind: "sandbox-unreachable" }
  | { kind: "supervisor-backoff"; attempt: number; nextAt: string }

Timestamps are ISO-8601 strings. Current computeRunState / deriveRunState emits approval, event, timer, approval-decided-resume-required, and external-trigger blocked reasons, plus engine-heartbeat-stale unhealthy reasons. The other variants are reserved by the public type and may appear on future run-state surfaces.

RunStateView

type RunStateView = {
  runId: string;
  state: RunState;
  blocked?: ReasonBlocked;
  unhealthy?: ReasonUnhealthy;
  computedAt: string;        // ISO-8601
};

blocked is present for a waiting-* state when computeRunState can load matching pending approval/timer/event context, when a parked waiting-event run has no event-waiting node, or when that context is supplied to deriveRunState. A waiting-* state can be returned without blocked when the supporting row is unavailable. unhealthy is present for current stale and orphaned results. recovering is reserved by the type but is not currently emitted by the DB derivation. Terminal states (succeeded, failed, cancelled) carry neither.

computeRunState

import { computeRunState } from "@smithers-orchestrator/db/runState";

const view = await computeRunState(adapter, runId);
view.state;       // "running" | ...
view.blocked;     // present for waiting-* only when backing context is found
view.unhealthy;   // present for stale/orphaned heartbeat expiry

computeRunState is pure over the DB plus the heartbeat / lease signals on the run row. It does not call ps, does not probe sockets, and does not run heuristics. deriveRunState is the underlying pure function, useful in tests or when you already have the rows in memory:

import { deriveRunState } from "@smithers-orchestrator/db/runState";

const view = deriveRunState({
  run,
  pendingApproval,
  pendingTimer,
  pendingEvent,
  now: 1_700_000_000_000,
  staleThresholdMs: 30_000,
});

The default staleThresholdMs is 30_000, the same threshold the engine uses for isRunHeartbeatFresh.

Where it shows up

RunStateView is the wire format on every read surface:

bunx smithers-orchestrator inspect RUN_ID: top-level runState field on the JSON output (and rendered in the human view).
Gateway RPC getRun: runState field on the response.
DevTools snapshot header: runState?: RunStateView field.
Event stream: RunStatusChanged records persisted status transitions. RunStateChanged is a typed/reserved event variant, but the current runtime does not emit it; call getRun or computeRunState when you need the derived RunStateView.

A run id that does not exist is not a RunState. It’s an error (RUN_NOT_FOUND). unknown is for ambiguity, not for “doesn’t exist.”

​RunState

​ReasonBlocked / ReasonUnhealthy

​RunStateView

​computeRunState

​Where it shows up

RunState

ReasonBlocked / ReasonUnhealthy

RunStateView

computeRunState

Where it shows up