providers - lash architecture

Provider Boundary

The runtime hands every turn's LlmRequest to whichever ProviderHandle the session was opened with and expects back a normalized LlmResponse stream. Anything wire-specific stays inside the provider crate.

Schema normalization (Anthropic message blocks → unified content shape, OpenAI tool-call ids → canonical record).
Reasoning-detail replay (OpenAI Responses reasoning items, Anthropic thinking blocks).
Cache-marker placement (Anthropic cache_control, OpenAI prompt_cache_key).
Token-field interpretation (cached-read vs cache-write deltas, reasoning tokens, audio tokens).
Auth flow (API key, OAuth PKCE, device code).
Model policy (variant aliases, structured-output capability detection, thinking exposure).

Modes (standard, rlm) and canonical tool definitions never see provider-specific JSON. They operate on LlmRequest, LlmResponse, and ToolDefinition only.

First-Party Providers

Three provider crates ship with the workspace, contributing five provider kinds: lash-provider-openai supplies both the direct Responses path and the generic Chat Completions path. The CLI compiles in all five and materializes each from its provider factory keyed on the spec kind. App hosts pick the provider factories they need.

Provider	Kind	Transport / Auth
OpenAI API (direct)	`openai`	Bearer API-key auth against `https://api.openai.com/v1/responses`. Responses-only path; handles Responses reasoning replay and prompt-cache fields.
OpenAI-compatible	`openai-compatible`	Bearer API-key auth against a caller-supplied `base_url`; posts Chat Completions to `{base_url}/chat/completions`. Used for OpenRouter, Together, vLLM, etc.
Codex subscription	`codex`	OAuth device-code flow against ChatGPT Codex Responses backend with Codex-specific headers.
Anthropic	`anthropic`	API-key auth against `/v1/messages` with Anthropic version header and beta flags.
Google Gemini / Code Assist	`google_oauth`	Google OAuth PKCE / manual-code flow against Code Assist `generateContent` / `streamGenerateContent`.

Provider specs remain host config data. Runtime sessions persist only provider ids and bind them to the single live ProviderHandle supplied by the host when the session opens.

OpenAI Provider Split

OpenAI ships as two distinct provider kinds because the Responses API and the Chat Completions API are different enough to deserve different code paths.

`openai`

Direct OpenAI Responses. Posts to https://api.openai.com/v1/responses, keeps Responses reasoning replay, and maps shared ProviderOptions.cache_retention to prompt_cache_key derived from the Lash session id. Long retention adds prompt_cache_retention where the API supports it. No base_url accepted.

`openai-compatible`

Generic Chat Completions. Requires base_url. Converts LlmRequest to a messages array, emits Chat Completions tools, maps structured output to response_format, and preserves OpenRouter reasoning effort through the reasoning.effort request field. Used for OpenRouter, vLLM, Together, Groq, etc.

Claude Cache Markers

For Anthropic and OpenRouter Claude on Chat Completions, shared ProviderOptions.cache_retention controls Anthropic-style cache_control markers in the request:

`none`

No markers emitted. Each request is treated as fresh.

`short`

Emits {"type":"ephemeral"} at the canonical breakpoints. Default 5-minute cache lifetime per Anthropic semantics.

`long`

Adds "ttl":"1h" to the ephemeral marker for longer-lived caching.

Breakpoints are placed at:

The first system/developer text message.
The last tool definition in the request.
Any explicit LlmContentBlock::Text.cache_breakpoint the runtime asks for.

When no explicit breakpoint is set, the provider falls back to the last user/assistant text content so prompt caching still works for sessions without explicit cache instrumentation.

RLM Prompt Caching

RLM projects chronological history as append-only chat-shaped messages: user inputs remain user messages, prior RLM steps become assistant messages, tool observations become user messages, and the mutable current-iteration/finalization prompt stays as the final user message. The rolling cache_breakpoint is placed on the last stable history text block, so OpenRouter Claude caches a real prefix instead of a rewritten history blob each turn.

The result: long RLM sessions get the same cache-hit rate as native multi-turn chats, even though Lashlang-driven reasoning regenerates a fresh prompt on every iteration.

Reliability Policy

ProviderOptions.reliability configures request timeouts, retry behavior, and local rate limiting for a ProviderHandle. The facade reexports the knobs under lash::provider.

timeouts

ProviderReliability.request_timeout and chunk_timeout map to LlmTimeouts. By default, requests have a 300 second overall timeout and streams have a 120 second chunk timeout. Set RequestTimeout::Disabled only when an outer workflow or transport layer owns cancellation.

retries

ProviderRetryPolicy defaults to four attempts with exponential delay capped at 10 seconds and Retry-After capped at 60 seconds. ProviderReliability::disabled() turns retries off for hosts that already retry at a higher layer.

throttles

A retryable quota failure carrying Retry-After (an HTTP 429) is a throttle, not a failure: the ladder waits as instructed without consuming a retry attempt. The deference is bounded by the cumulative throttle_wait_budget_ms (default 90 seconds per call; 0 disables it, and each deferred wait charges at least one second), after which throttles count as ordinary retryable failures. A 429 without Retry-After follows the normal backoff-and-count ladder.

rate limits

ProviderRateLimitPolicy can cap max concurrency, requests per window, and tokens per window before a provider request is admitted.

use lash::provider::{ProviderReliability, RequestTimeout};

let reliability = ProviderReliability::default()
    .request_timeout(Some(RequestTimeout::Millis(120_000)))
    .stream_chunk_timeout_ms(Some(30_000))
    .max_attempts(3)
    .max_concurrency(Some(8))
    .requests_per_window(Some(120), Some(60_000));

Wrapping Providers

lash ships no admission-control machinery in the dispatch path (docs/adr/0015-admission-control-lives-in-provider-decorators.md). The Provider trait is deliberately dyn-compatible and decorator-friendly: hosts own admission — in-flight windows, priority lanes, circuit breakers, backpressure metrics — by wrapping the provider they install. A decorator's complete() sees every provider call, including each attempt of the retry ladder.

/// Host-owned admission: bounded in-flight windows per traffic class,
/// wrapped around the provider the host installs. Breakers, AIMD windows,
/// and backpressure metrics slot into the same `complete()` seam.
#[derive(Debug)]
struct AdmissionGate {
    inner: Box<dyn Provider>,
    interactive_slots: Arc<tokio::sync::Semaphore>,
    batch_slots: Arc<tokio::sync::Semaphore>,
}

impl Clone for AdmissionGate {
    fn clone(&self) -> Self {
        Self {
            inner: self.inner.clone_boxed(),
            // Shared handles: every clone admits through the same windows.
            interactive_slots: Arc::clone(&self.interactive_slots),
            batch_slots: Arc::clone(&self.batch_slots),
        }
    }
}

#[async_trait::async_trait]
impl Provider for AdmissionGate {
    async fn complete(&mut self, request: LlmRequest) -> Result<LlmResponse, LlmTransportError> {
        // Class traffic by the session identity the host already owns:
        // deliberate ids for its own pipelines, and a default lane for ids
        // it did not mint (lash-spawned child sessions).
        let lane = if request.scope.session_id.starts_with("batch:") {
            &self.batch_slots
        } else {
            &self.interactive_slots
        };
        // The permit drops on every exit path — success, failure, or a
        // cancelled turn — so an aborted call never leaks a slot.
        let _slot = lane.acquire().await.expect("admission gate closed");
        self.inner.complete(request).await
    }

    // Forward `close()` explicitly: the default impl is a no-op and would
    // silently skip the inner provider's transport shutdown.
    async fn close(&self) -> Result<(), LlmTransportError> {
        self.inner.close().await
    }

    fn kind(&self) -> &'static str {
        self.inner.kind()
    }
    fn options(&self) -> ProviderOptions {
        self.inner.options()
    }
    fn set_options(&mut self, options: ProviderOptions) {
        self.inner.set_options(options);
    }
    fn serialize_config(&self) -> serde_json::Value {
        self.inner.serialize_config()
    }
    fn requires_streaming(&self) -> bool {
        self.inner.requires_streaming()
    }
    fn clone_boxed(&self) -> Box<dyn Provider> {
        Box::new(self.clone())
    }
}

Install the wrapper with ProviderComponents::map_provider wherever the handle is constructed:

let interactive_slots = Arc::new(tokio::sync::Semaphore::new(8));
let batch_slots = Arc::new(tokio::sync::Semaphore::new(2));
let handle = ProviderHandle::new(components.map_provider(|inner| {
    Box::new(AdmissionGate {
        inner,
        interactive_slots,
        batch_slots,
    })
}));

forward close()

The trait's default close() is a no-op. A wrapper that omits the forward silently skips the inner provider's transport shutdown (Codex sends WebSocket Close frames on its cached sessions). Forward it explicitly.

re-wrap on rebuild

serialize_config forwards to the inner provider, so a ProviderSpec round-trip rebuilds the bare provider — the wrapper does not survive persistence. Apply map_provider at every construction site (factory resolution, resume paths), not once at startup.

cancellation-safe admission

The runtime aborts the in-flight provider call when a turn is cancelled, dropping the complete() future at an await point. Anything acquired before the inner call must release on drop — semaphore permits do this naturally; a hand-rolled counter needs a drop guard.

class by session id

scope.session_id is contractually present on every provider request. Mint deliberate ids for pipelines you want to class (direct completions accept a host-supplied session id and otherwise scope as direct:{uuid}), and give ids you did not mint — lash-spawned child sessions — a default lane rather than rejecting them.

The retry ladder wraps the decorator, so core stays honest about throttling on its own (see the throttles row above): an admission gate cannot compensate for a ladder that would turn a 429 storm into attempt exhaustion. The admission use-case pairs with the metrics guidance on operations.

Usage And Cost Inputs

Every provider returns a normalized LlmUsage to the runtime usage ledger after each completion. Chat-parsing handles both streaming and non-streaming usage chunks, including OpenRouter cache fields. Cache-write tokens are tracked separately from cached reads, and reasoning tokens are treated as an output subset rather than an additive total, so downstream cost/export code can price each bucket without double-counting.

pub struct LlmUsage {
    pub input_tokens: i64,              // Uncached ordinary input
    pub cache_read_input_tokens: i64,   // Prompt input read from cache
    pub cache_write_input_tokens: i64,  // Prompt input written to cache
    pub output_tokens: i64,             // Total generated output
    pub reasoning_output_tokens: i64,   // Subset of output tokens
    // …provider-specific extras flow through the extended trace
}

Adding A Provider

New providers implement five focused components and a factory:

State

Per-handle config: API key, base URL, default model, options like thinking exposure.

Auth

Bearer header, OAuth refresh, device-code flow, whatever the vendor requires. Auth state is opaque to the runtime. OAuth backends (Codex, Google) share token storage and refresh primitives from lash-provider-auth; vendor-specific endpoints, device-code, and PKCE helpers live in each provider's OAuth module: oauth.rs for Google, codex/oauth.rs for Codex.

Readiness

Optional pre-flight check (token refresh, capability probe) that runs once per session.

Transport

The actual HTTP call. Translates LlmRequest to the wire format, streams the response, normalizes back to LlmResponse chunks.

Model policy

Maps user-facing model + variant names to model-native ids, declares structured-output / tool-call / thinking capabilities per model.

Factory

Materializes host config into provider components; ProviderHandle::new(components) assembles the five pieces into the live handle the runtime can use.

See lash-provider-openai as the most general template: it handles both the direct Responses path and the generic Chat Completions path in one crate.

providers/transports