The other in-memory fields on LashRuntime are either ephemeral per-turn accumulators (shared_token_ledger, drained into state at commit) or per-worker coordination (managed_sessions, managed_turns, what this instance is currently running). None of it is durable truth.
Durable truth lives behind a single trait, RuntimePersistence (crates/lash-core/src/store/mod.rs), whose surface is exactly a durable-object store:
load_session, load_node: rehydrate working state.
try_claim_session_execution_lease, renew_session_execution_lease, release_session_execution_lease: the durable single-writer execution lane for a session. Foreground turns, pending user input dispatch, wake delivery, and mutating session commands must hold this lease before touching session work.
commit_runtime_state: fenced session-head commit. The live session execution lease is verified in the same transaction as the head revision CAS; CAS is the backstop, not the normal concurrency mechanism.
enqueue_queued_work, claim_ready_queued_work, renew_/abandon_/cancel_*: a durable work queue with claim_id, owner identity plus incarnation, claim_token, and claim_fencing_token. Claims are also fenced by the current session execution lease.
save_/load_session_meta, tombstone_nodes, vacuum, gc_unreachable.
Any backend that satisfies this trait (and the ProcessRegistry / LashlangArtifactStore traits) is a valid store. The backend-agnostic conformance suite (lash_core::testing::conformance) is what guarantees a new backend behaves identically.
Horizontal topology
flowchart TB
Clients["clients · external signals"] --> LB["load balancer
route by session_id"]
LB --> WA["worker A"]
LB --> WB["worker B"]
LB --> WN["worker N
(identical, stateless)"]
WA --> P["shared session store
session execution leases · head CAS · durable work queue"]
WB --> P
WN --> P
WA -.-> H["networked TriggerStore
occurrences · subscriptions · deliveries"]
WB -.-> H
WN -.-> H
R["Restate
durable execution + dispatch"] -.-> WA
R -.-> WB
R -.-> WN
The store box is shared by the whole fleet, so it must be a networked backend every worker can reach. lash-postgres-store supplies the distributed RuntimePersistence, ProcessRegistry, TriggerStore, and LashlangArtifactStore; lash-s3-store supplies the shared attachment byte store. The local-file SQLite store remains right for embedded and single-host deployments.
Concurrent writers are prevented before effects: the session execution lease lets exactly one runtime owner execute mutating work for a session at a time. commit_runtime_state still performs head CAS in the same transaction as lease-fence verification, so stale writers fail safely if a lease is lost or a backend violates the single-lane contract.
By default every opened runtime gets a fresh owner id and incarnation id. A durable workflow engine that already serializes retries for one invocation can set SessionBuilder::session_execution_owner(LeaseOwnerIdentity::opaque(...)) with a stable owner plus incarnation when it intentionally wants reentry. A new incarnation must not reenter by owner id alone; local-process liveness permits fenced fast reclaim only when the previous holder is definitely dead. Choosing owner identity and lease timings for failover latency is host policy; see running in production.
Per-session request lifecycle
flowchart LR
Req["request for
session_id"] --> Lease["claim session
execution lease"]
Lease --> Load["load_session
rehydrate working copy"]
Load --> Turn["run turn
sans-io compute + replayed effects"]
Turn --> Commit{"commit_runtime_state
lease fence + CAS"}
Commit -->|wins| Done["committed
release lease"]
Commit -->|stale fence / loses CAS| Stop["runtime error
reload before retry"]
You scale across sessions (shard/route by session_id), not within one (see Inherent Limits).
| Primitive | Role | Where |
lash-sansio pure TurnMachine | compute is deterministic (state, input) → (state, effects) | crates/lash-sansio |
RuntimePersistence | the single durable store seam; backend-swappable | store/mod.rs, conformance suite |
| session execution lease | single durable execution lane for every mutating session path | store/mod.rs, session_execution_leases |
commit_runtime_state fence + CAS on session_head | atomic session-head write and stale-writer backstop | store/mod.rs |
| queue with claim + fencing tokens | exactly-once work pickup by any lease-holding session runner | queued_work_batches |
effect replay keys + lash-restate | re-run a turn on another worker, replay effects | crates/lash-restate |
lash-postgres-store + lash-s3-store | networked runtime/process/trigger/artifact state plus content-addressed attachment bytes | crates/lash-postgres-store, crates/lash-s3-store |
rehydratable state + Residency | no durable-only memory; reload anywhere | runtime/session_ops.rs |
| Choice | What it needs |
| Networked store | Use PostgresStorage for sessions, pending turn inputs, queued work, process registry, triggers, and Lashlang artifacts; use S3AttachmentStore for attachment bytes. The backend-agnostic conformance suites run against Postgres. |
| Turn durability across workers | Deploy turns under Restate with RestateRuntimeEffectController; a crash reruns the handler and replays effect outcomes before the final session commit. |
| Fleet work dispatch | Restate can push handlers for turn and process workflows, while shared queue and process rows retain claim/fencing semantics for recoverable work. The distributed worker E2E exercises two workers behind an h2c proxy. |
| Session-agnostic ingress | External signals must route to whichever worker holds the session, so ingress must not be bolted to one session. See Triggers below. |
| Affinity vs pure reload | A routing policy choice: session_id affinity (warm working-copy cache) vs reload-per-request (simpler, more store reads). The session execution lease supports either; Residency tunes reload cost. |
| Operations envelope | Choose Postgres and object-store sizing, Restate retention/ingress topology, trace collection, idempotency-key policy, and replay observability for the host product. |
Single-writer-per-session
A session is a serial conversation; the session execution lease enforces one active mutating runner. You scale across many sessions, not within one; ideal for multi-tenant workloads, but a single session's throughput is domain-bounded.
The store is the ceiling
Whatever networked backend is chosen must sustain lease claims, renewals, fenced commits, and queue claims at fleet scale. The store choice is the scaling story.
Long LLM turns
A long turn holds or migrates a worker; a durable effect host is the answer, at the cost of replay on recovery. Restate is the first-party adapter, not the only possible workflow runtime.
Not a blocker: tokio
Statelessness is about where durable truth lives, not the async runtime. lash-core staying on tokio is orthogonal: each worker is a tokio service; the durable truth is already 100% external to the process.
Ingress: emit → match → idempotent delivery → wake
flowchart LR
Sig["external signal
button · mail · cron · webhook"] --> Emit["router.emit
record occurrence"]
Emit --> Match["reserve_matching_deliveries
match by source_type + source_key"]
Match --> Deliver["idempotent effect delivery
deterministic occurrence / delivery id"]
Deliver --> Proc["start trigger process
with stored registrant + env_ref"]
Proc --> Wake["wake → queued work
EarliestSafeBoundary"]
- A second store seam. Subscriptions, occurrences, and delivery reservations live behind a
TriggerStore trait, parallel to RuntimePersistence. The first-party shared-worker implementation is PostgresTriggerStore; another fleet can use a different backend with the same matching and reservation semantics.
- The event carries no session; the subscription carries process edges. An emitted event is matched by
source_type and source_key against registered subscriptions. Each subscription stores the registrant, captured execution environment, and optional wake target for the process it will start (a wake target also earns that session a handle grant at delivery). A single event fans out to 0..N subscribers, including subscriptions with no session edges. Session identity enters only through those optional edges, never through the event.
- Delivery is an idempotent effect. Each matched delivery runs through the effect controller with a deterministic id (
occurrence_id from source_type + source key + idempotency key; delivery process id from occurrence_id + subscription_id). Delivery is exactly-once and replay-safe: any worker can run it, and replay under a durable effect host cannot double-start a trigger process.
- Only a wake target is session-coupled. When a delivered process wakes an agent and has a live wake target, that wake lands in the target session's queued work at
EarliestSafeBoundary: the sole point where session identity and turn-ordering apply to trigger delivery.
Net for the fleet: route external signals to a runtime-level emit → match → idempotent effect-delivery → wake. Trigger delivery reuses the existing claim and replay machinery; wake processing then enters the target session through the session execution lease. The Postgres-backed implementation is exercised by the distributed Restate/Postgres/MinIO E2E, but the contract is the TriggerStore, session-store, and effect-host boundary.