Architecture chapter 11

durability/workflows

Agent turns cross process boundaries. Tool calls may run for minutes, exec blocks may hand off to other workers, and the orchestrator on top might be a durable workflow runtime that expects deterministic replay. lash-sansio is the pure state machine that makes this seam explicit — every side effect is a yielded value, every result is a fed-back response, and a serialised checkpoint can be replayed by a different process to land in the same place.

The Sansio Seam

lash-sansio::TurnMachine is a sans-IO state machine. It does not call providers, run tools, or execute code on its own. Instead, it yields effects — LlmCall, ToolCalls, ExecCode, Checkpoint — and waits for the host to feed results back. The whole machine is Serialize / Deserialize; the host can snapshot it at any point, store the bytes anywhere, and resume on a different worker.

That shape is exactly what durable workflow runtimes (Temporal, Restate, similar) want from a library they're embedding. The workflow's own ledger records that effect E was started, persists the result, and replays the workflow code with E's output already in hand. A lash turn driven from inside such a workflow gets the same property: the turn's state-machine progress survives crashes, redeployments, and worker migrations because the workflow runtime owns the persistence and lash owns nothing the workflow can't replay.

The standard LashRuntime wraps the sansio machine with its own async I/O loop — it calls providers, schedules tools, runs RLM exec. Embedders who only need in-process durability through session persistence never see sansio. Embedders integrating with an external orchestrator drive sansio directly.

Effects And Responses

The Effect enum is the full surface of side effects a turn can request. The host fulfils each effect and replies with a matching Response indexed by EffectId.

EffectPayloadResponse
SyncExecutionSurfaceupdate_machine_config flagExecutionSurfaceSynced with refreshed prompt + tool specs
LlmCallrequest: Arc<LlmRequest>LlmComplete with Result<LlmResponse, LlmCallError>
ToolCallscalls: Vec<PendingToolCall>ToolResults with Vec<CompletedToolCall>
ExecCodecode: StringExecResult with Result<ExecResponse, String>
Checkpointcheckpoint: CheckpointKindCheckpoint with plugin-injected messages
Sleepduration: DurationTimeout
CancelLlmid of the call to cancelnone — fire-and-forget
Log / Emitstructured eventnone — fire-and-forget
Progresscommitted messages + events at this iterationnone — durable persistence signal
Donefinal protocol messages + eventsnone — turn terminal; the high-level runtime still materializes prose AssistantMessage output during its shared final commit

Progress is the host's hook for durable persistence: every time the machine commits new messages or mode-step events, it emits a Progress effect before continuing. Workflow runtimes treat each Progress as an activity boundary; in-process embedders fold it into their own persistence loop. Prose finalization is an outcome, not a low-level SessionEvent::Message { kind: "final" } side channel.

Waiting States

When the machine yields an effect that needs a response, it parks in a MachineState variant that retains exactly the data needed to re-emit the same effect identically after restore. Five variants matter for durability:

WaitingLlm

effect_id, the original Arc<LlmRequest>, and optional mode driver_state. Restore re-issues the same request — same model, same messages, same tool spec — under the same EffectId.

WaitingTools

effect_id plus the full Vec<PendingToolCall> batch the model emitted. On restore, the host sees the same list with the same call ids; tools that already completed in a partial batch are not rerun (see Replay Safety below).

WaitingExec

effect_id, the original code block, and the mode's driver_state. RLM, for example, persists its trajectory cursor and projected bindings here so the same exec re-runs in the same context.

WaitingCheckpoint

effect_id, the CheckpointKind being run, and an on_empty resume action — whether to prepare another mode iteration or finish the turn if the checkpoint plugin contributes no messages.

WaitingExecutionSurface

effect_id plus update_machine_config flag. The least state-bearing of the waiting variants — restore simply re-asks the host for the live tool surface.

Non-waiting

PreparingMode, PrepareIteration, and Finished hold no outstanding effect. Restoring into these is a no-op for re-emission; the machine just resumes from the persisted message and event log.

EffectId Stability

EffectId(u64) is a monotonic counter held in the machine. The next id is allocated on each yielded effect; the counter is part of TurnCheckpoint and is restored on resume. The first effect of a fresh turn is always EffectId(1); the n-th effect is always EffectId(n) regardless of how many times the workflow crashed and replayed in between.

Use the id as your workflow's idempotency key. A Temporal activity keyed by (session_id, turn_id, effect_id) never runs twice for the same logical effect: the first attempt persists its result, replay reads from history. The same property holds for Restate's invocation keys and any other workflow runtime that supports keyed deduplication.

EffectId is local to one turn. Across turns, the counter resets to 1; pair it with the enclosing turn id to construct globally unique keys.

Checkpoint And Restore

Two methods bracket every durable boundary:

impl<M: ModeProtocol> TurnMachine<M> {
    pub fn checkpoint(&self) -> TurnCheckpoint<M>;

    pub fn restore_from_checkpoint(
        config: TurnMachineConfig<M>,
        checkpoint: TurnCheckpoint<M>,
    ) -> Self;
}

TurnCheckpoint<M> is fully serde-compatible. Fields:

  • state: MachineState<M> — the waiting variant, with the data each retains.
  • pending_effects: Vec<CheckpointEffect<M>> — fire-and-forget effects queued but not yet drained (Emit, Log, Progress).
  • next_effect_id: u64 — counter, preserves id determinism.
  • messages: Vec<Message> — prompt history committed so far.
  • events: Vec<SessionEventRecord<M::Event>> — semantic event log committed so far.
  • mode_iteration, mode_run_offset — turn position.
  • cumulative_usage: TokenUsage — running token totals for this turn.
  • termination: TurnTerminationPolicyState — max-turns / forced-final state.
  • synced_mode_iteration: Option<usize> — last iteration the execution surface was synced.

The TurnMachineConfig passed to restore_from_checkpoint is not serialised — it carries the protocol driver, projector, tool registry, system prompt builder, and other host-owned plugins. The host reconstructs it the same way it built it for new(), then hands it back alongside the deserialised checkpoint. This is intentional: the config is what makes the abstract machine concrete, and it lives at the same layer as the workflow runtime that's driving the machine.

Replay Safety

Restore reissues the outstanding effect, not the whole effect history. The event log persisted in TurnCheckpoint.events is reloaded as-is — past ToolCallStarted, ToolCallCompleted, AssistantProseDelta records do not re-emit. The same is true of past mode-step events. From the workflow's perspective, only the unresolved effect needs activity coverage; everything before it is fact.

  • Mid-LLM crash → restore yields a new LlmCall with the same EffectId and identical request. The workflow's activity for that id is the same lookup and produces the same response.
  • Post-LLM crash, pre-checkpoint → restore yields the pending Checkpoint effect, not another LlmCall. The model is not re-invoked.
  • Mid-tool-batch crash → restore yields a ToolCalls effect carrying every original call. Tools that completed in the partial batch are surfaced to the host via the response shape (each CompletedToolCall includes its outcome). Hosts can short-circuit completed calls by id and only re-invoke the pending ones.
  • RLM exec crash → restore yields an ExecCode effect with the original code block and the driver's trajectory state. The exec result contains the parallel tool outputs in their structured form (success / failure / cancellation); no replayed ToolCallStarted events leak from the previous attempt.
  • Plugin checkpoint crash → restore yields the Checkpoint effect again. The on_empty resume action is preserved, so a checkpoint that contributed no messages still finishes the turn the same way.

Driving The Machine From A Workflow

The basic loop a workflow runtime wraps around the sansio machine:

// `workflow.run_activity(...)` is whatever the runtime exposes for
// durable side effects — keyed by effect_id for idempotency.

let mut machine = TurnMachine::new(config, messages, events, mode_run_offset);

while !machine.is_done() {
    while let Some(effect) = machine.poll_effect() {
        match effect {
            Effect::LlmCall { id, request } => {
                let response = workflow
                    .run_activity(("llm", turn_id, id), || invoke_llm(request))
                    .await;
                machine.handle_response(Response::LlmComplete {
                    id,
                    result: response,
                    text_streamed: false,
                });
            }
            Effect::ToolCalls { id, calls } => {
                let results = workflow
                    .run_activity(("tools", turn_id, id), || run_tool_batch(calls))
                    .await;
                machine.handle_response(Response::ToolResults { id, results });
            }
            Effect::Progress { messages, events, .. } => {
                workflow.persist(turn_id, messages, events).await;
            }
            Effect::Done { messages, events, .. } => {
                workflow.finalize(turn_id, messages, events).await;
            }
            // … the remaining variants follow the same pattern.
            _ => {}
        }
    }
    // Snapshot the machine at every iteration so a worker crash
    // resumes against the latest waiting state, not the turn start.
    workflow.persist_checkpoint(turn_id, machine.checkpoint()).await;
}

The shape matches Temporal's workflow + activity split and Restate's handler + side-effect split with no glue layer: each effect becomes one durable activity, the machine snapshot becomes one workflow-state write, and the EffectId is the key that joins them.

What's Covered, What Isn't

The durability seam covers the in-turn effects the machine yields. Some workloads sit outside it:

Inside the seam

LLM calls, native tool batches, RLM exec, plugin checkpoints, execution-surface sync, retry sleeps. Each gets an EffectId, each round-trips through the host, each survives crashes deterministically.

Outside the seam

Fire-and-forget Emit and Log effects, plugin background tasks spawned through ToolContext::tasks(), anything a tool implementation does that's not its execute() return value. Workflows that need these durable should fold them into their own activities.

Sessions opened through the regular LashCore / LashSession path also get durability through the session store, but at a coarser grain: a turn either commits to the store atomically or fails as a whole. Use sansio directly when you need crash recovery inside a turn — when a single LLM call is expensive enough to replay, or when a tool batch can run for minutes and you can't afford to lose its partial progress.

Test Coverage

The durability properties above are pinned by integration tests in two places:

TestAsserts
checkpoint_before_llm_completion_reissues_same_logical_llm_call
crates/lash-sansio/src/sansio/tests.rs:502
Restore from a checkpoint taken mid-LLM yields a fresh LlmCall effect with the same EffectId and identical request payload.
checkpoint_after_llm_result_replays_checkpoint_without_second_llm
:521
Restore after the LLM completed but before the checkpoint plugin ran yields the pending Checkpoint effect — no duplicate model call.
checkpoint_preserves_parallel_tool_batch_before_any_result
:706
Round-tripping a WaitingTools state preserves every PendingToolCall with its id and arguments intact.
checkpoint_after_mixed_tool_batch_results_replays_model_feedback_once
:757
Mixed success / failure / cancellation tool outcomes round-trip without re-emitting per-tool events to the stream.
checkpoint_round_trips_waiting_exec_driver_state
:871
RLM exec waits round-trip with their driver state intact; the same code block and trajectory cursor resume after restore.
standard_checkpoint_redrives_parallel_tool_batch_without_losing_calls
crates/lash-mode-rlm/tests/protocol_drivers.rs:254
End-to-end through a protocol driver: standard-mode parallel tool batches resume identically after restore.
standard_checkpoint_after_tool_control_finish_preserves_terminal_outcome
:311
A tool that authored a terminal value via tool-control finishes the same way on replay.
rlm_checkpoint_redrives_pending_exec_code_with_driver_state
:716
RLM pending ExecCode re-runs with the same driver state, projected bindings intact.
rlm_checkpoint_after_exec_parallel_tool_outputs_preserves_structured_outcomes
:786
RLM exec results carrying parallel tool outcomes preserve success / failure / cancellation structure across restore.
rlm_exec_result_stores_tool_call_ids_without_replayed_tool_events
:882
RLM exec outcomes carry tool call ids without leaking replayed ToolCallStarted events to the activity log.