- Schema normalization (Anthropic message blocks → unified content shape, OpenAI tool-call ids → canonical record).
- Reasoning-detail replay (OpenAI Responses reasoning items, Anthropic thinking blocks).
- Cache-marker placement (Anthropic
cache_control, OpenAI prompt_cache_key).
- Token-field interpretation (cached-read vs cache-write deltas, reasoning tokens, audio tokens).
- Auth flow (API key, OAuth PKCE, device code).
- Model policy (variant aliases, structured-output capability detection, thinking exposure).
Modes (standard, rlm) and canonical tool definitions never see provider-specific JSON. They operate on LlmRequest, LlmResponse, and ToolDefinition only.
| Provider | Kind | Transport / Auth |
| OpenAI API (direct) | openai | Bearer API-key auth against https://api.openai.com/v1/responses. Responses-only path; handles Responses reasoning replay and prompt-cache fields. |
| OpenAI-compatible | openai-compatible | Bearer API-key auth against a caller-supplied base_url; posts Chat Completions to {base_url}/chat/completions. Used for OpenRouter, Together, vLLM, etc. |
| Codex subscription | codex | OAuth device-code flow against ChatGPT Codex Responses backend with Codex-specific headers. |
| Anthropic | anthropic | API-key auth against /v1/messages with Anthropic version header and beta flags. |
| Google Gemini / Code Assist | google_oauth | Google OAuth PKCE / manual-code flow against Code Assist generateContent / streamGenerateContent. |
Provider specs remain host config data. Runtime sessions persist only provider ids and bind them to the single live ProviderHandle supplied by the host when the session opens.
openai
Direct OpenAI Responses. Posts to https://api.openai.com/v1/responses, keeps Responses reasoning replay, and maps shared ProviderOptions.cache_retention to prompt_cache_key derived from the Lash session id. Long retention adds prompt_cache_retention where the API supports it. No base_url accepted.
openai-compatible
Generic Chat Completions. Requires base_url. Converts LlmRequest to a messages array, emits Chat Completions tools, maps structured output to response_format, and preserves OpenRouter reasoning effort through the reasoning.effort request field. Used for OpenRouter, vLLM, Together, Groq, etc.
none
No markers emitted. Each request is treated as fresh.
short
Emits {"type":"ephemeral"} at the canonical breakpoints. Default 5-minute cache lifetime per Anthropic semantics.
long
Adds "ttl":"1h" to the ephemeral marker for longer-lived caching.
Breakpoints are placed at:
- The first system/developer text message.
- The last tool definition in the request.
- Any explicit
LlmContentBlock::Text.cache_breakpoint the runtime asks for.
When no explicit breakpoint is set, the provider falls back to the last user/assistant text content so prompt caching still works for sessions without explicit cache instrumentation.
The result: long RLM sessions get the same cache-hit rate as native multi-turn chats, even though Lashlang-driven reasoning regenerates a fresh prompt on every iteration.
timeouts
ProviderReliability.request_timeout and chunk_timeout map to LlmTimeouts. By default, requests have a 300 second overall timeout and streams have a 120 second chunk timeout. Set RequestTimeout::Disabled only when an outer workflow or transport layer owns cancellation.
retries
ProviderRetryPolicy defaults to four attempts with exponential delay capped at 10 seconds and Retry-After capped at 60 seconds. ProviderReliability::disabled() turns retries off for hosts that already retry at a higher layer.
throttles
A retryable quota failure carrying Retry-After (an HTTP 429) is a throttle, not a failure: the ladder waits as instructed without consuming a retry attempt. The deference is bounded by the cumulative throttle_wait_budget_ms (default 90 seconds per call; 0 disables it, and each deferred wait charges at least one second), after which throttles count as ordinary retryable failures. A 429 without Retry-After follows the normal backoff-and-count ladder.
rate limits
ProviderRateLimitPolicy can cap max concurrency, requests per window, and tokens per window before a provider request is admitted.
use lash::provider::{ProviderReliability, RequestTimeout};
let reliability = ProviderReliability::default()
.request_timeout(Some(RequestTimeout::Millis(120_000)))
.stream_chunk_timeout_ms(Some(30_000))
.max_attempts(3)
.max_concurrency(Some(8))
.requests_per_window(Some(120), Some(60_000));
/// Host-owned admission: bounded in-flight windows per traffic class,
/// wrapped around the provider the host installs. Breakers, AIMD windows,
/// and backpressure metrics slot into the same `complete()` seam.
#[derive(Debug)]
struct AdmissionGate {
inner: Box<dyn Provider>,
interactive_slots: Arc<tokio::sync::Semaphore>,
batch_slots: Arc<tokio::sync::Semaphore>,
}
impl Clone for AdmissionGate {
fn clone(&self) -> Self {
Self {
inner: self.inner.clone_boxed(),
// Shared handles: every clone admits through the same windows.
interactive_slots: Arc::clone(&self.interactive_slots),
batch_slots: Arc::clone(&self.batch_slots),
}
}
}
#[async_trait::async_trait]
impl Provider for AdmissionGate {
async fn complete(&mut self, request: LlmRequest) -> Result<LlmResponse, LlmTransportError> {
// Class traffic by the session identity the host already owns:
// deliberate ids for its own pipelines, and a default lane for ids
// it did not mint (lash-spawned child sessions).
let lane = if request.scope.session_id.starts_with("batch:") {
&self.batch_slots
} else {
&self.interactive_slots
};
// The permit drops on every exit path — success, failure, or a
// cancelled turn — so an aborted call never leaks a slot.
let _slot = lane.acquire().await.expect("admission gate closed");
self.inner.complete(request).await
}
// Forward `close()` explicitly: the default impl is a no-op and would
// silently skip the inner provider's transport shutdown.
async fn close(&self) -> Result<(), LlmTransportError> {
self.inner.close().await
}
fn kind(&self) -> &'static str {
self.inner.kind()
}
fn options(&self) -> ProviderOptions {
self.inner.options()
}
fn set_options(&mut self, options: ProviderOptions) {
self.inner.set_options(options);
}
fn serialize_config(&self) -> serde_json::Value {
self.inner.serialize_config()
}
fn requires_streaming(&self) -> bool {
self.inner.requires_streaming()
}
fn clone_boxed(&self) -> Box<dyn Provider> {
Box::new(self.clone())
}
}
Install the wrapper with ProviderComponents::map_provider wherever the handle is constructed:
let interactive_slots = Arc::new(tokio::sync::Semaphore::new(8));
let batch_slots = Arc::new(tokio::sync::Semaphore::new(2));
let handle = ProviderHandle::new(components.map_provider(|inner| {
Box::new(AdmissionGate {
inner,
interactive_slots,
batch_slots,
})
}));
forward close()
The trait's default close() is a no-op. A wrapper that omits the forward silently skips the inner provider's transport shutdown (Codex sends WebSocket Close frames on its cached sessions). Forward it explicitly.
re-wrap on rebuild
serialize_config forwards to the inner provider, so a ProviderSpec round-trip rebuilds the bare provider — the wrapper does not survive persistence. Apply map_provider at every construction site (factory resolution, resume paths), not once at startup.
cancellation-safe admission
The runtime aborts the in-flight provider call when a turn is cancelled, dropping the complete() future at an await point. Anything acquired before the inner call must release on drop — semaphore permits do this naturally; a hand-rolled counter needs a drop guard.
class by session id
scope.session_id is contractually present on every provider request. Mint deliberate ids for pipelines you want to class (direct completions accept a host-supplied session id and otherwise scope as direct:{uuid}), and give ids you did not mint — lash-spawned child sessions — a default lane rather than rejecting them.
The retry ladder wraps the decorator, so core stays honest about throttling on its own (see the throttles row above): an admission gate cannot compensate for a ladder that would turn a 429 storm into attempt exhaustion. The admission use-case pairs with the metrics guidance on operations.
pub struct LlmUsage {
pub input_tokens: i64, // Uncached ordinary input
pub cache_read_input_tokens: i64, // Prompt input read from cache
pub cache_write_input_tokens: i64, // Prompt input written to cache
pub output_tokens: i64, // Total generated output
pub reasoning_output_tokens: i64, // Subset of output tokens
// …provider-specific extras flow through the extended trace
}
State
Per-handle config: API key, base URL, default model, options like thinking exposure.
Auth
Bearer header, OAuth refresh, device-code flow, whatever the vendor requires. Auth state is opaque to the runtime. OAuth backends (Codex, Google) share token storage and refresh primitives from lash-provider-auth; vendor-specific endpoints, device-code, and PKCE helpers live in each provider's OAuth module: oauth.rs for Google, codex/oauth.rs for Codex.
Readiness
Optional pre-flight check (token refresh, capability probe) that runs once per session.
Transport
The actual HTTP call. Translates LlmRequest to the wire format, streams the response, normalizes back to LlmResponse chunks.
Model policy
Maps user-facing model + variant names to model-native ids, declares structured-output / tool-call / thinking capabilities per model.
Factory
Materializes host config into provider components; ProviderHandle::new(components) assembles the five pieces into the live handle the runtime can use.
See lash-provider-openai as the most general template: it handles both the direct Responses path and the generic Chat Completions path in one crate.