> ## Documentation Index
> Fetch the complete documentation index at: https://smithers-feat-claude-workflow-mirror.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Run State

> The single typed model for "what is this run doing right now?"

Every Smithers surface (CLI, Gateway, and programmatic clients) answers
"what is this run doing right now?" by reading the same `RunStateView`,
computed server-side from persisted state plus liveness signals.

Surfaces never infer status from `ps`, event absence, or partial table
reads. They call `computeRunState`.

## RunState

```ts theme={null}
type RunState =
  | "running"           // owner alive, work progressing
  | "waiting-approval"  // blocked on a human decision
  | "waiting-event"     // blocked on an external signal
  | "waiting-timer"     // blocked on a scheduled wakeup
  | "recovering"        // supervisor replaying / resuming
  | "stale"             // owner heartbeat expired, not yet recovered
  | "orphaned"          // owner gone, no supervisor candidate
  | "failed"
  | "cancelled"
  | "succeeded"
  | "unknown"           // telemetry gap; never invent a state
```

When state cannot be determined (e.g. a missing heartbeat with ambiguous DB status), the runtime returns `unknown`. Never invent a more specific state.

The legacy run-row `status` column maps to `RunState` like this:

| `_smithers_runs.status` | `RunState`            |
| ----------------------- | --------------------- |
| `running` (fresh)       | `running`             |
| `running` (stale)       | `stale` or `orphaned` |
| `waiting-approval`      | `waiting-approval`    |
| `waiting-event`         | `waiting-event`       |
| `waiting-timer`         | `waiting-timer`       |
| `finished`              | `succeeded`           |
| `continued`             | `succeeded`           |
| `failed`                | `failed`              |
| `cancelled`             | `cancelled`           |
| missing / unknown text  | `unknown`             |

`recovering` is reserved for the supervisor takeover window. As of this writing, it is defined but not yet emitted by the runtime.

<Warning>
  **`waiting-event` is overloaded; do not assume it always means "send a signal."** The same status covers two situations that need opposite recovery:

  1. A genuine pause on a [`<Signal>`](/components/signal) or [`<WaitForEvent>`](/components/wait-for-event). Recover by delivering the awaited signal: `bunx smithers-orchestrator signal RUN_ID SIGNAL_NAME --data '{}'`.
  2. A run parked with no live event node: an external-trigger / hot-reload / orphan-recovery wait, an aspect or alert human-request, or an approval decided while the run was detached. Here a signal does nothing; the run needs a resume (or `retry-task` / `fork` / answering the human request) after you fix the underlying cause.

  Do not guess between them. Ask the runtime, which already disambiguates on
  `RunStateView.blocked.kind`:

  ```bash theme={null}
  bunx smithers-orchestrator why RUN_ID     # names the real blocker and prints the exact unblock command
  bunx smithers-orchestrator inspect RUN_ID # includes runState.blocked.kind
  ```

  `runState.blocked.kind === "event"` means a node is truly awaiting an event.
  When the run row says `waiting-event` but no node is, `runState.blocked.kind`
  is `approval-decided-resume-required` for a detached approval continuation, or
  `external-trigger` for a generic parked external wait. `why` still provides the
  most detailed unblock command and may report other blockers such as
  `retries-exhausted` or `dependency-failed`.
</Warning>

<Warning>
  **`succeeded`/`finished` can mask failed child agents.** Run-level status is binary; `finished` or `failed`; and a run is only `failed` when it has an *unhandled* task failure. Two large classes of failed children are deliberately not treated as run-level failures: tasks marked [`continueOnFail`](/components/task), and agent tasks that fail with a *transient* error (rate limits, timeouts, aborts; `SESSION_ERROR`, `TASK_TIMEOUT`, `TASK_HEARTBEAT_TIMEOUT`, `TASK_ABORTED`, or `failureRetryable`). A fan-out where most agents hit provider rate limits can therefore still report `finished` → `succeeded`.

  Never trust the top-level status alone to conclude a run truly succeeded. Inspect per-node/agent outcomes:

  ```bash theme={null}
  bunx smithers-orchestrator inspect RUN_ID   # top-level runState plus every node's state (look for failed nodes)
  bunx smithers-orchestrator node NODE_ID      # one node's failure/error, retries, and tool calls
  bunx smithers-orchestrator events --run RUN_ID  # the full event history, including per-task failures
  bunx smithers-orchestrator scores RUN_ID     # scorer results when you gate on judged output
  ```

  You don't have to re-derive the count by eye. When a `finished` run tolerated at
  least one failed child, the masking is surfaced as a first-class **`failedChildren`**
  signal in three places:

  * **`inspect`** adds `failedChildren` (a count) and `failedChildKeys` (the failed
    task state keys, `nodeId::iteration`) to its JSON next to `runState`, plus a
    `node …` call-to-action. They are absent when nothing failed, so
    `failedChildren > 0` is the degraded-run gate.
  * The **`RunResult`** returned by a programmatic run carries the same
    `failedChildren` / `failedChildKeys` fields (omitted when zero).
  * The **`RunFinished`** event row records `failedChildren` / `failedChildKeys`, so the
    CLI, Gateway, and DevTools can flag the degraded outcome from the event stream
    without re-reading every node row.

  The run-row `status` stays `finished` (and `RunState` stays `succeeded`) so existing
  `finished === success` callers and `continueOnFail` semantics are unchanged. The
  count is the signal, not a new terminal status.

  To avoid the rate-limit failures in the first place, bound fan-out concurrency with [`<Parallel maxConcurrency={N}>`](/components/parallel) so you don't burst past the provider's limit.
</Warning>

## ReasonBlocked / ReasonUnhealthy

`ReasonBlocked` and `ReasonUnhealthy` are optional reason payloads for
waiting and unhealthy states. The type unions are wider than the variants
currently derived from the DB rows.

```ts theme={null}
type ReasonBlocked =
  | { kind: "approval"; nodeId: string; requestedAt: string }
  | { kind: "event";    nodeId: string; correlationKey: string }
  | { kind: "timer";    nodeId: string; wakeAt: string }
  | { kind: "approval-decided-resume-required"; nodeId: string }
  | { kind: "external-trigger" }
  | { kind: "provider"; nodeId: string; code: "rate-limit" | "auth" | "timeout" }
  | { kind: "tool";     nodeId: string; toolName: string; code: string }

type ReasonUnhealthy =
  | { kind: "engine-heartbeat-stale"; lastHeartbeatAt: string }
  | { kind: "ui-heartbeat-stale";     lastSeenAt: string }
  | { kind: "db-lock" }
  | { kind: "sandbox-unreachable" }
  | { kind: "supervisor-backoff"; attempt: number; nextAt: string }
```

Timestamps are ISO-8601 strings. Current `computeRunState` / `deriveRunState`
emits `approval`, `event`, `timer`, `approval-decided-resume-required`, and
`external-trigger` blocked reasons, plus `engine-heartbeat-stale` unhealthy
reasons. The other variants are reserved by the public type and may appear on
future run-state surfaces.

## RunStateView

```ts theme={null}
type RunStateView = {
  runId: string;
  state: RunState;
  blocked?: ReasonBlocked;
  unhealthy?: ReasonUnhealthy;
  computedAt: string;        // ISO-8601
};
```

`blocked` is present for a `waiting-*` state when `computeRunState` can load
matching pending approval/timer/event context, when a parked `waiting-event`
run has no event-waiting node, or when that context is supplied to
`deriveRunState`. A `waiting-*` state can be returned without `blocked` when
the supporting row is unavailable. `unhealthy` is present for current `stale`
and `orphaned` results. `recovering` is reserved by the type but is not
currently emitted by the DB derivation. Terminal states (`succeeded`, `failed`,
`cancelled`) carry neither.

## computeRunState

```ts theme={null}
import { computeRunState } from "@smithers-orchestrator/db/runState";

const view = await computeRunState(adapter, runId);
view.state;       // "running" | ...
view.blocked;     // present for waiting-* only when backing context is found
view.unhealthy;   // present for stale/orphaned heartbeat expiry
```

`computeRunState` is pure over the DB plus the heartbeat / lease signals
on the run row. It does not call `ps`, does not probe sockets, and does
not run heuristics.

`deriveRunState` is the underlying pure function, useful in tests or
when you already have the rows in memory:

```ts theme={null}
import { deriveRunState } from "@smithers-orchestrator/db/runState";

const view = deriveRunState({
  run,
  pendingApproval,
  pendingTimer,
  pendingEvent,
  now: 1_700_000_000_000,
  staleThresholdMs: 30_000,
});
```

The default `staleThresholdMs` is `30_000`, the same threshold the
engine uses for `isRunHeartbeatFresh`.

## Where it shows up

`RunStateView` is the wire format on every read surface:

* `bunx smithers-orchestrator inspect RUN_ID`: top-level `runState` field on the JSON
  output (and rendered in the human view).
* Gateway RPC `getRun`: `runState` field on the response.
* DevTools snapshot header: `runState?: RunStateView` field.
* Event stream: `RunStatusChanged` records persisted status transitions.
  `RunStateChanged` is a typed/reserved event variant, but the current runtime
  does not emit it; call `getRun` or `computeRunState` when you need the
  derived `RunStateView`.

A run id that does not exist is not a `RunState`. It's an error
(`RUN_NOT_FOUND`). `unknown` is for ambiguity, not for "doesn't exist."
