operations · lash docs

What This Is

The runtime state that operational policy acts on — session-execution leases, queued-work and turn-input claims, durable waits, cached provider transports, and trace buffers — lives inside lash. lash's obligation is that every reasonable host policy is implementable through explicit, lash-owned levers over that state.

Deployment Topologies

The same runtime runs in three shapes. Pick the smallest one that meets your durability and failover needs; the levers on this page apply to all three.

single process

One host, embedded state. Use the local-file SQLite store for durable single-host runtimes, or omit a store for throwaway demos. Failover is process restart on the same box; local-process lease liveness makes it fast. Start at persistence.

multi-worker + shared Postgres

Identical stateless workers routed by session_id, sharing one PostgresStorage and S3-compatible attachment bytes. The session-execution lease enforces one mutating writer per session across the fleet. The full model is in scaling.

Restate-backed durable tier

Turns run under a workflow engine with RestateRuntimeEffectController, so a crash reruns the handler and replays effect outcomes before the final commit. Long turns and background processes survive worker loss. See durability and replay.

Worker Identity

By default every opened session gets a fresh owner id and incarnation id, so two workers on the same session resolve at the lease boundary. A host that wants sub-TTL reclaim of a crashed owner sets an explicit identity with SessionBuilder::session_execution_owner.

An identity is a stable owner_id (one per replica, stable across restarts) plus an incarnation_id the process bumps each boot. Reentry requires the same owner id and the same incarnation; a new incarnation must claim or reclaim through the fenced lease path rather than clearing the old owner. What differs between deployments is the liveness metadata that identity carries.

Liveness	Construct with	What it buys
`LocalProcess`	`LeaseOwnerIdentity::local_process(owner_id, incarnation, host_id)`	Attaches this host's kernel boot id, pid, and process-start time. A same-host peer can prove a crashed holder definitely dead and reclaim its lease before the TTL. Requires a matching host id and kernel boot id; a machine reboot changes the boot id and falls back to TTL. Reads `/proc`, so it degrades to opaque off Linux.
`Opaque`	`LeaseOwnerIdentity::opaque(owner_id, incarnation)`	No liveness proof. A crashed holder's lease is recovered only when its TTL expires. This is the correct choice for cross-host and distributed holders, where no peer can inspect another's process table.

// A stable owner id per replica plus a per-boot incarnation. `local_process`
// attaches this host's kernel boot id and pid, so a same-host peer can prove
// a crashed holder dead and reclaim its lease before the TTL. On a non-Linux
// host, or across a machine reboot (the boot id changes), it degrades to
// opaque, TTL-only reclaim.
let owner = LeaseOwnerIdentity::local_process(
    std::env::var("WORKER_ID").unwrap_or_else(|_| "worker-1".to_string()),
    std::env::var("AGENT_SERVICE_INCARNATION").unwrap_or_else(|_| boot_incarnation()),
    std::env::var("HOSTNAME").unwrap_or_else(|_| "host-1".to_string()),
);

// Cross-host / opaque holders (the common distributed case) get TTL-only
// reclaim; build them with `LeaseOwnerIdentity::opaque(owner_id, incarnation)`.
let session = core
    .session(chat_id)
    .session_execution_owner(owner)
    .open()
    .await?;

Lease Timings

One host-configurable LeaseTimings on the core builder governs every durable single-writer lane. It is the failover-latency vs false-takeover-risk knob, moved out of lash and into host hands.

Lane	What the timing governs
session execution	How long a mutating session runner holds its single-writer lease and how often it renews.
effect replay	The durable effect-replay leases; effect hosts (SQLite/Postgres replay options) accept the same type, so one decision spans both boundaries.
queued work	The claim TTL for durable queued-work batches picked up by any lease-holding runner.
turn input	The claim TTL for pending turn-input dispatch.
process leases	How long a durable process worker holds a process lease and how often it renews.

The default is a 30 second TTL with a 10 second renew interval. The constructor enforces ttl >= 3 * renew_interval, so a healthy owner can miss two consecutive renewals — a scheduler stall, a transient store error — before a peer may treat the lease as expired. A shorter TTL reclaims a crashed owner's work sooner but widens the window in which a slow-but-alive owner is falsely taken over; a longer TTL does the reverse. Choosing that number is the trade, and it is now a host decision instead of a constant baked into lash.

// One timing decision governs every durable single-writer lane lash claims:
// session-execution leases, effect replay, queued-work and turn-input
// claims, and process leases. `new` enforces `ttl >= 3 * renew_interval`, so
// a live owner can miss two renewals before a peer may treat the lease as
// expired. A shorter TTL reclaims a crashed owner's work sooner; a longer
// one tolerates slower renewal and lowers false-takeover risk.
let lease_timings = LeaseTimings::new(
    Duration::from_secs(15), // ttl
    Duration::from_secs(5),  // renew_interval
)
.expect("ttl >= 3 * renew_interval");

let core = LashCore::rlm_builder(factory)
    .provider(provider)
    .model(
        lash::ModelSpec::from_token_limits("anthropic/claude-sonnet-4.6", None, 200_000, None)
            .expect("valid model metadata"),
    )
    .store_factory(store_factory)
    .effect_host(Arc::new(InlineEffectHost::default()))
    .lease_timings(lease_timings) // omit to keep the 30s ttl / 10s renew default
    .build()?;

Graceful Drain

There is no LashCore::shutdown(). A host composes its own drain from explicit levers, in host-owned order, with host-owned deadlines. Each numbered step below is a policy decision the host makes; the calls are the levers lash provides to carry it out.

// lash ships no drain orchestrator (ADR-0014): each step is an explicit,
// host-owned lever. The order below and every deadline are host policy.

// 1. Stop admitting new turns. A host-layer decision — flip a readiness
//    flag, drain the load balancer. lash cannot see your ingress.

// 2. Finish or cancel in-flight turns. A live turn shares the session and
//    makes park/close fail with `SessionStillInUse` until it ends.
for session in &idle_sessions {
    session.cancel_running_turns();
}

// 3. Park resumable sessions (flush dirty state through a fresh-lease commit,
//    release the lease, keep a cheap handle) or `close()` ephemeral ones.
//    Both consume the session and need exclusive ownership.
for session in idle_sessions {
    let parked = session.park().await?;
    // Cache `parked` keyed by `parked.session_id()` and rebuild it later
    // with `LashCore::resume(parked)`; drop it instead to fully close.
    let _ = parked.session_id();
}

// 4. If you stopped an external queued-work or turn-input driver mid-claim,
//    hand its claims back so peers take the work now instead of waiting out
//    the claim TTL — `session.abandon_queued_work_claim(&claim)` and
//    `session.abandon_turn_input_claim(&claim)` — and resolve outstanding
//    durable waits as `Cancelled` with `session.revoke_durable_waits()`.

// 5. Release provider transports. The default `close()` is a no-op; the
//    Codex provider sends WebSocket Close frames on its cached sessions.
let _ = provider.close().await;

// 6. Flush the trace sink (fsync for JSONL). OTel span-export durability is
//    the host's duty: `force_flush()`/`shutdown()` your own TracerProvider.
core.flush_trace_sink()?;

// 7. Exit. Any lease this process still holds now expires on its TTL.
Ok(())

Steps 1 and 2 are pure policy: lash cannot see your ingress or your grace budget, so stopping admission and deciding whether to await or cancel in-flight turns stays with the host. Steps 3 through 6 are the levers — park/close, the claim and wait handbacks, provider close, trace flush — each an explicit call with no hidden ordering. Anything lash cannot expose as a lever without becoming an orchestrator (signal handling, drain deadlines, readiness endpoints) is host territory by design.

The agent service example wires the subset a stateless request/response host owns: an axum graceful-shutdown signal stops admission, then the process closes its retained provider handle and flushes its trace sink before exit.

A host that runs its own durable process worker has one more terminal-writing step, orthogonal to the session drain above: DurableProcessWorker::drain_owner_bound_work(). lash does not fold it into a facade drain, because only a host that constructs the worker — directly through lash::durability::DurableProcessWorker::new, or reached through RestateProcessDeployment::worker() — owns its lifecycle. Run it inside the worker's own shutdown, after stopping admission and releasing in-flight run leases. It terminalizes every non-terminal OwnerBound row this worker started (the row's first_started.owner equals the worker's lease owner) as Abandoned{OwnerDrain} under a fresh drain lease — the owner completing its own work is the ordinary graceful path, and shell.start processes carry OwnerBound. Rerunnable in-flight work (lashlang engine and subagent turns) takes the opposite contract: the worker just stops the local run task and writes no terminal, leaving the row non-terminal so the next worker re-runs it. Rows another owner started, not-yet-started OwnerBound rows (still claimable by any peer), and ExternallyOwned rows are left untouched. What happens to a host's OwnerBound work when it crashes instead of draining is the subject of background process recovery.

Background Process Recovery

When a host crashes instead of draining, its non-terminal background processes are recovered by the next durable-process-worker sweep any peer drives — on session open, after a start, or on lease expiry. The sweep is not a re-run-everything loop. It obeys each row's declared Recovery Disposition (docs/adr/0019-process-recovery-obeys-declared-disposition.md): the required contract, with no default, that a producer stamps at registration — shell.start declares OwnerBound, the lashlang engine and subagent turns declare Rerunnable, and external placeholders declare ExternallyOwned.

The pivotal distinction the sweep draws is provably dead vs. merely lease-expired. Claiming a row's lease, it separates a fenced reclaim of a holder proven dead (is_definitely_dead_for_claimant — the same-host liveness proof LocalProcess identity carries, described under worker identity) from simply acquiring a free or TTL-expired lease with no death evidence. Elapsed time alone never terminalizes anything: a slow-but-alive owner whose lease lapsed is silent, not dead, and the sweep leaves its started work exactly where it is. Only a positive death proof — or an operator's explicit authorization (below) — converts that uncertainty into a terminal fact.

The verdict each disposition yields after owner loss:

Row	Recovery verdict
`Rerunnable`	Claimed and re-executed under a fresh lease, exactly as before this contract existed. The right policy for journaled, idempotent inputs (engine rows, session-turn rows).
`OwnerBound`, never started	Any peer may claim and run it: first execution is not re-execution. The runner records the durable `first_started` fact under its lease immediately before executing.
`OwnerBound`, started, holder provably dead	Terminalized `Abandoned{Sweep}`, never re-run. A replacement execution could duplicate non-idempotent side effects, so the only sound recovery is to stop.
`OwnerBound`, started, holder silent (lease lapsed, no death evidence)	Left non-terminal. Lease expiry is not death. The row waits until a death proof appears or an Abandon Request is reconciled — see detecting a stuck process.
`ExternallyOwned`	Never claimed, never executed. Closes only through an external `complete_process` call or a reconciled Abandon Request. Detached commands (`shell.start` with `detach: true`) are born here, already terminal.

Every Abandoned write — at drain, at the sweep, or from a reconciled request — goes through the sweep's own fenced lease, so the terminal has exactly one legitimate writer and a revenant owner that reappears is rejected by its stale lease token rather than healed back to running. Abandoned is a fourth terminal state peer to Completed | Failed | Cancelled; it rides await_output and reconcile like any terminal, and (per docs/adr/0017-process-observation-is-best-effort-push-over-state-truth.md) it does not ride the best-effort event sink. On a Restate deployment the same rules hold at the engine seam: ingress skips ExternallyOwned submission, the run handler completes a re-invoked started OwnerBound row as Abandoned{Sweep} instead of re-running it, and pending abandon requests are reconciled there too.

Detecting A Stuck Process

lash ships no stuck-process daemon and writes no "stuck" verdict. Staleness is a host-built read-side classification over raw facts (ADR 0019): the runtime exposes the lease and start facts, and the host decides what its own timeouts mean. The silent, not dead case above is exactly the ambiguity this read-side check surfaces.

The recipe: list the non-terminal rows, then read four raw fields off each ObservedProcess and classify.

// List the live rows (host-wide, or `list_granted_to`/`list_originated_by`
// for a session lens), then classify each from raw facts — no derived verdict.
let live = core
    .processes()
    .list(&ProcessListFilter {
        status: ProcessStatusFilter::Running,
        ..ProcessListFilter::default()
    })
    .await?;
for p in live {
    let lease_valid = p.lease_expires_at_ms.is_some_and(|ms| ms > now_ms);
    match (&p.disposition, p.first_started.is_some(), lease_valid) {
        // Lease still in the future: the owner is renewing. Not stuck.
        (_, _, true) => {}
        // Rerunnable / not-yet-started OwnerBound: the sweep re-runs or claims
        // it; a lapsed lease here is just work awaiting a peer.
        (RecoveryDisposition::Rerunnable, _, false) => {}
        (RecoveryDisposition::OwnerBound, false, false) => {}
        // Started OwnerBound, lease lapsed: silent, not dead. The sweep will
        // NOT terminalize it without death evidence. This is the stuck case
        // a host classifies from `lease_holder` + your own timeout budget.
        (RecoveryDisposition::OwnerBound, true, false) => {
            let _ = (&p.lease_holder, p.abandon_request.as_ref());
            // ...decide, per host policy, whether to authorize abandonment:
            // core.processes().request_abandon(&p.process_id, "operator", reason)
        }
        _ => {}
    }
}

the four facts

disposition (the declared contract), first_started (present once the row has begun executing), lease_holder (the current lease owner's identity), and lease_expires_at_ms (its expiry). A pending abandon_request is a fifth, visible while it awaits reconciliation. These are facts, not a status: the host applies its own timeout budget.

silent, not dead

An expired lease on a started OwnerBound row with no death evidence is the one truly ambiguous state: the work may have succeeded a millisecond before the owner went silent. The sweep refuses to guess, so the row stays non-terminal indefinitely unless a host acts. That is the deliberate cost of never fabricating an outcome.

the escape hatch

core.processes().request_abandon(process_id, requested_by, reason) writes a durable Abandon Request — the operator's recorded authorization to accept uncertainty (who, when, why). It authorizes exactly one thing: the sweep reconciling the row into Abandoned{ReconciledRequest}, and only once the owner's lease has lapsed. It never terminalizes anything itself, never touches the owner's OS resources, and never fences the owner off — a still-live owner keeps its lease until it expires. The marker is visible to observers while pending and is returned on the ObservedProcess the call yields.

Process Retention

Terminal process rows accumulate. The registry never garbage-collects itself — only the host knows how long its consumers still need to await or reconcile a finished process — so retention is one more explicit lever.

core.processes().prune(cutoff_epoch_ms) physically deletes terminal rows — with their events, wakes, handle grants, and leases — older than the cutoff, returning a ProcessPruneReport of what it reclaimed. Non-terminal rows are never touched. Run it on the same maintenance cadence as the session store's vacuum / gc_unreachable reclamation (see persistence).

The agent service example wires this exactly: it captures a core.processes() handle before the core moves into app state and drives prune from a maintenance task on a fixed cadence with a retention window longer than any wait, in both the inline and Restate durability modes.

Failure Classification

Retry loops need to tell "try again" from "this will never work" from "unknown". lash carries typed signals for exactly that split, so a host never scrapes error strings.

loop {
    match session.turn(TurnInput::text(text)).run().await {
        Ok(output) => {
            // A failed LLM call finishes the turn instead of erroring; read
            // the typed provider signal off the turn's issues.
            for issue in &output.result.errors {
                if issue.retryable == Some(true) {
                    // Transient provider/transport failure — safe to re-run.
                }
                if let Some(kind) = issue.provider_failure_kind {
                    let _ = kind; // Timeout, Http, Quota, Auth, Stream, ...
                }
            }
            return Ok(output);
        }
        // SessionExecutionBusy / SessionExecutionLeaseLost: another owner
        // holds or fenced the lease, so the attempt committed nothing.
        Err(err) if err.is_retryable() => continue, // back off in real code
        // Wiring/config a retry can never repair (missing facet, provider
        // unconfigured). Surface it to an operator.
        Err(err) if err.is_terminal() => return Err(err),
        // Neither typed signal: unknown. Apply your own bounded policy.
        Err(err) => return Err(err),
    }
}

is_retryable

True only for a typed retryable signal: session_execution_busy (another owner holds the lease; nothing changed) and session_execution_lease_lost (the lease was fenced away mid-turn; the fenced commit means the attempt committed nothing and its claims were released). Both are safe to re-run after a backoff.

is_terminal

True only when a retry can never succeed without host changes: builder and wiring errors (missing facet, handler context) and provider-configuration errors. The same call fails identically until the host changes its wiring.

busy vs lease-lost

Busy means a live owner holds the lane; back off and retry the same session. LeaseLost means ownership moved; reopen the session before retrying. store_commit_failed is deliberately neither retryable nor terminal — the code cannot distinguish transient store I/O from a real conflict — so reload and apply your own bounded policy. The full conflict-retry shape is on persistence.

provider failures

A failed LLM call finishes the turn rather than erroring, so its typed retryability rides on TurnIssue.retryable and TurnIssue.provider_failure_kind (Transport, Timeout, Http, Stream, Auth, Validation, Quota, Unsupported, Unknown) — read them off the finished turn's issues, not off an EmbedError.

Observation And Backpressure

Live observation is best-effort and bounded so a slow observer can never stall a turn or grow memory without limit. Know the numbers and the fallbacks.

live-replay bound

The default in-memory live-replay buffer keeps at most 2048 events or 120 seconds per session. It is not durable history and is not required to survive process loss. Deployments that need a different window pass a custom store to LashCoreBuilder::live_replay_store.

slow-observer gap

A stale, trimmed, or unavailable cursor yields a recoverable gap rather than a guess: the observer discards missed assumptions, replaces state from the fresh observation's read view, stores gap.latest_cursor, and keeps folding. Live-replay append failures are logged and never fail a turn or a commit. Details on streaming.

turn-stream backpressure

stream_to awaits the sink's emit(). If a UI transport can block, push into a bounded channel and drain it from a separate task; the channel bound becomes your backpressure policy, keeping a slow client from holding the turn open.

Monitoring

Per-turn timing, cumulative token usage, and traces come off the runtime directly — wire them into whatever metrics and tracing stack the host already runs.

// Per-turn timing, straight off the runtime clock.
let started_at = output.result.started_at(); // SystemTime the turn was claimed
let elapsed = output.result.duration(); // claim -> commit + post-persist hooks
let _ = (started_at, elapsed);

// Cumulative token usage for the session, split by source and by model.
let usage = session.usage_report();
let _ = (usage.entry_count, usage.usage);

timing

TurnResult::started_at is the wall-clock instant the runtime claimed the turn; TurnResult::duration is the whole-turn window — claim through final commit and post-persist hooks — on the runtime clock's monotonic source. Export both as request latency.

usage

session.usage_report() aggregates per-turn token deltas into totals split by source and by model, including child sessions. Emit it as cost and rate-limit telemetry.

traces

Attach a trace sink and flush it before exit. JSONL flush fsyncs the file; for OTel, span-export durability stays the host's duty — flush your own TracerProvider. See tracing.

admission

lash ships no admission control in the dispatch path: in-flight windows, priority lanes, breakers, and backpressure metrics are host policy, installed by wrapping the provider. The pattern and its disciplines are on providers.

Where Next

This page composes the levers into policy. The scaling and durability pages own the fleet and replay mechanics those levers act on.