Internals

Silo is built on top of SlateDB, which is a low-level key value store. Silo implements and maintain higher-level data-structurs within this KV store.

Entities

Job

One unit of work that should be attempted until completion or retries exhausted. Moves through states of Scheduled, Waiting, Running, Succeeded, Retrying, Failed, or Cancelled.

Attempt

One attempt to run a unit of work. Moves through states of Running, Succeeded, Failed, or Cancelled.

Task

One item that we need to dequeue and run. Usually, its a task for a worker to run an attempt, but there are other task types that represent internal operations that should happen after a point in time in the future. Most tasks need to be processed by the external worker that is polling Silo for what to do next, but some tasks are handled purely internally.

Task lease

One task that has been leased by a worker. Upon leasing, a task is moved from the task queue into the task leases. A worker must complete the task or heartbeat for it before the lease expired, or else the system will assume the worker has crashed.

Tenancy

All of Silo is built to house job data for multiple tenants, if necessary. All data for a single tenant lives in the same SlateDB database which ensures fast transactions for every hot-path operation. An individual tenant’s data can’t be split across shards.

Key/value layout

Silo uses the storekey crate for binary key encoding. This provides lexicographically-sortable binary keys that maintain correct ordering when compared as byte strings. Each key type has a unique single-byte prefix to ensure namespace isolation.

Key types and their prefixes

Prefix	Name	Description
0x01	Job info	Stores job payloads: `(tenant, job_id)`
0x02	Job status	Stores job status: `(tenant, job_id)`
0x03	Status/time index	Secondary index for jobs by status, newest-first: `(tenant, status, inverted_timestamp, job_id)`
0x04	Metadata index	Secondary index for jobs by metadata key/value: `(tenant, key, value, job_id)`
0x05	Task	Work items in task order: `(task_group, start_time_ms, priority, job_id, attempt)`
0x06	Lease	Leased tasks: `(task_id)`
0x07	Attempt	Attempt records: `(tenant, job_id, attempt_number)`
0x08	Concurrency request	Jobs waiting for concurrency slots: `(tenant, queue, start_time_ms, priority, request_id)`
0x09	Concurrency holder	Tasks holding concurrency slots: `(tenant, queue, task_id)`
0x0A	Job cancelled	Cancellation flags: `(tenant, job_id)`
0x0B	Floating limit	Dynamic concurrency limit state: `(tenant, queue_key)`
0xF0-0xFF	System	Internal counters, cleanup state, and metadata

Clustering approach

Silo works by separating the compute and storage tiers. Compute is provided by stateless or semi-stateless cloud compute, and storage is provided by cloud object storage. Compute can be deployed, rescheduled, or scaled at will. Data durability comes from object storage, not from the compute nodes themselves.

Within a cluster, all data is sharded. At any one time, each shard is opened and exclusively read or written to by a single compute node. The number of shards can grow over time by splitting shards. These shard leases are permanent — they aren’t locks with liveness checking. Instead, they persist through node crashes to prevent WAL data loss when using the split WAL mode. See Permanent shard leases below for details.

Shard splitting

When a shard becomes overloaded, it can be split into two child shards to distribute the load. Splits divide the tenant ID keyspace at a chosen split point:

Left child: owns tenant IDs [parent_start, split_point)
Right child: owns tenant IDs [split_point, parent_end)

The split point is always a tenant ID boundary, ensuring that all data for any individual tenant remains together in a single shard. This preserves tenant atomicity—Silo’s concurrency queues and transactions operate within a single shard, so a tenant’s data can never be divided.

The split operation proceeds through several phases:

SplitRequested — The split is initiated on the node owning the shard.
SplitPausing — Traffic to the parent shard is paused (clients receive retryable errors).
SplitCloning — The underlying SlateDB database is cloned for both child shards, then the shard map is atomically updated to replace the parent with the two children. The shard map update is the commit point.
SplitComplete — Children are active and ready to serve traffic. The split record is cleaned up.

After a split completes, each child shard contains cloned data from the parent. A background cleanup process then removes any data that falls outside each child’s new range (defunct keys), followed by a compaction to reclaim storage.

For crash recovery, the shard map is the source of truth. If a crash occurs before the shard map update (commit point), the split is abandoned and the parent remains intact. Any orphaned clone databases are left behind but don’t affect correctness—they’re just wasted storage that can be garbage collected later. If a crash occurs after the shard map update, the split is committed and children exist in the shard map.

The clustering approach is modelled after something similar to Google Cloud Bigtable, where individual compute node members all hold leases over some subset of the overall shards. Unlike Bigtable, there’s no central coordiantor managing which bits are assigned to which nodes. Instead, we just manage an internal hash ring that assigns the entire shard space to all the nodes, shifting around as few shards as possible when cluster membership changes.

Shard coordination plan

Goals:

Elastic compute membership with fast join/leave
Shards are the atomic unit of load; clients specify a shard on each request
Minimal downtime on ownership changes and no split-brain
etcd is used only for control plane (membership, locks, handover), never per-request on the data plane

Key ideas:

Membership (ephemeral): Each node registers under coord/members/<node_id> with a membership lease (TTL, e.g. 10s). Value includes { addr, started_at, node_id, capabilities }.
Ring computation: All nodes watch coord/members/ and deterministically build a consistent hash ring (e.g. Ketama with N virtual nodes per member).
Ownership: The active owner of shard <S> is the node that holds the etcd key coord/shards/<S>/owner. Value: { owner_id, owner_addr, acquired_at }. Unlike membership, shard ownership keys are not tied to any TTL-based lease — they persist until explicitly released (graceful shutdown) or force-released (operator action).
Data plane independence: Nodes serve a shard only if they currently hold its owner key and are active members. Requests never consult etcd; they rely on local in-memory ownership state maintained via watches/keepalives.

Node lifecycle:

Join
- Create a membership lease (TTL ~10s, keepalive ~2s).
- Put coord/members/<node_id> with the membership lease.
- Scan for any shard leases already held by this node from a previous run (reclaim existing leases for WAL recovery).
- Build ring from current coord/members/* and compute desired shard set.
- Reconcile: for each desired shard, try to acquire coord/shards/<S>/owner via a put-if-absent transaction. Only start serving a shard after this key has been written.
- Start watches on coord/members/* and coord/shards/<S>/owner for shards you own or are next-in-line for.
Steady state
- Maintain keepalives for the membership lease. If keepalives fail (channel closes or prolonged error), immediately pause serving all owned shards to avoid split-brain.
- On membership change event, recompute ring and run reconciliation: acquire new shards, release shards no longer owned.
Graceful leave

Delete coord/members/<node_id> (using the membership lease), triggering ring recompute so contenders begin acquisition retries.
Quiesce briefly per shard (1–2s) to drain in-flight work.
Explicitly release each coord/shards/*/owner key for this node (delete-if-owner); new owners acquire on their next retry.

Safe ring changes

On ring change, we must carefully change ownership of shards with minimal downtime. To do this, we ensure that the new owners start trying to acquire the shard lock before the old owners release it.

The flow:

A node either adds or removes its membership details under coord/members/<node_id>
All nodes discover this membership change, and update their in-memory hash rings, which shifts which shards they are responsible for.
New shard owners immediately start trying to acquire the lease for the shards they want to own. Each runs a jittered loop: try Put of coord/shards/<S>/owner using “create only if not exists” semantics (i.e., CAS on missing key). If exists, backoff and retry; never overwrite.
Old shard owners wait a little bit, ensuring that the new shard owners discover the change and begin their retry loops in advance.
Old shard owners start relinquishing their shard locks (explicit delete-if-owner), and stop serving requests for those shards when that’s happened.

Crash case: the previous owner’s shard lease persists (see Permanent shard leases). An operator must force-release the lease via siloctl shard force-release before another node can acquire it.

Correctness and split-brain avoidance

Only the node holding coord/shards/<S>/owner and with active membership may serve <S>. All others must return NOT_OWNER immediately. Nodes must pause serving a shard if their membership keepalive fails, even if they still hold the shard lease, because another node might need to take over.

Failure modes:

Crash: shard leases persist (permanent). The crashed node retains its shard leases, preventing another node from accidentally opening the shard and losing WAL data. When the node restarts with the same node_id, it reclaims its leases and recovers the WAL. If the node is permanently lost, an operator must force-release the shard leases via siloctl shard force-release.
Network partition: owner loses membership keepalive; it must cease serving. Unlike a crash, the shard lease is still held, so no other node can acquire it until it’s released or force-released.
Simultaneous join: consistent hashing minimizes movement; CAS on owner guarantees single winner.

Permanent shard leases

Shard leases are permanent — they are not tied to any TTL-based lease and survive node crashes. This is critical for preventing WAL data loss when using a local filesystem for the write-ahead log.

Why permanent leases? When a node writes to a shard, data first goes to the local WAL before being flushed to object storage. If a node crashes before the WAL is flushed, that data only exists on the crashed node’s local disk. If another node were to immediately acquire the shard (as would happen with TTL-based ephemeral leases), it would open the shard from object storage and miss the unflushed WAL data, causing data loss.

With permanent leases:

Node crashes — the shard lease persists, blocking other nodes from acquiring the shard.
Node restarts — if the node restarts with the same node_id (e.g., a Kubernetes StatefulSet pod), it finds its existing shard leases, reclaims them, opens the shard databases, and recovers the WAL data from local disk. No data is lost.
Node permanently lost — if a node will never come back, an operator uses siloctl shard force-release <shard-id> to release the lease. The operator accepts that any unflushed WAL data on that node is lost. Another node then acquires the shard and starts serving it from the last state in object storage.

Configuration for permanent leases: For permanent leases to work correctly across restarts, the node_id must be stable. In Kubernetes, set node_id = "${POD_NAME}" so that a restarted StatefulSet pod re-acquires its own leases. If node_id is not configured, a random UUID is generated on each startup, meaning a restarted node won’t reclaim its previous leases.

When local WAL is not used: If the WAL is stored in object storage (same backend as the main data), permanent leases still apply but are less critical — there’s no local data to lose on crash. An operator could force-release shards from a crashed node without risking data loss in this configuration.

Request handling and redirection

On each RPC, the server checks a local map of owned shards. If not the owner, return NOT_OWNER with an optional hint { owner_id, owner_addr } read from the cached owner value. Clients retry a few times and may redirect using the hint. Servers never query the coordinator on the request path.

Concurrency Queues

Concurrency queues limit how many jobs with the same queue key can run simultaneously. They work like a ticket system: a job must hold a ticket for a queue before it can proceed to execution, and each queue has a limited number of tickets available.

Requesters and holders

The two key concepts are requesters and holders:

A holder is a task that has been granted a ticket and is occupying one of the queue’s concurrency slots. It’s either currently running on a worker, sitting in the task queue waiting to be picked up, or in the process of being leased. Holders are stored durably in SlateDB (prefix 0x09) and also tracked in an in-memory set for fast capacity checks.
A requester is a job that wants a ticket but the queue is full. Requesters are stored durably in SlateDB (prefix 0x08), ordered by (start_time, priority, request_id) so the most eligible job is always first in the scan. Requesters don’t have a corresponding task in the task queue — they’re parked, waiting for a holder to finish and free up a slot.

When a holder finishes (or is cancelled), the system promotes the next eligible requester into a holder — this is the core “release and grant next” operation.

Why in-memory counts?

Checking concurrency capacity requires knowing how many holders exist for a queue. Scanning the durable holder keys in SlateDB on every enqueue or grant would be too slow, so we maintain an in-memory HashMap<queue, HashSet<task_id>> of current holders (ConcurrencyCounts in src/concurrency.rs). This makes capacity checks a fast set-size lookup.

The in-memory counts are lazily hydrated: the first time a queue is accessed, we scan its durable holder keys to populate the set. After that, the in-memory state is kept in sync by updating it on every grant and release. The hydration scan filters by shard range to avoid double-counting holders that were cloned during a shard split.

The tricky part is keeping in-memory counts consistent with durable state. We use an optimistic reservation strategy: in-memory counts are updated before the DB write, and if the write fails, the caller must invoke explicit rollback methods. This eliminates any TOCTOU window where two concurrent operations could both see available capacity and double-grant.

Enqueue flow

When a job with concurrency limits is enqueued, limits are processed sequentially:

Immediate grant: Check the in-memory holder count against max_concurrency. If there’s room, atomically reserve a slot in memory, write a holder record to the batch, and advance to the next limit.
Queued: If the queue is full and the job is ready to run now, write a requester record to the batch. No task goes in the task queue — the job is parked until a slot opens.
Future-scheduled: If the queue is full and the job has a future start time, write a RequestTicket task to the task queue at that start time. When it fires, it re-attempts the grant.
All limits passed: Once every limit is satisfied, a RunAttempt task is written to the task queue. This task carries a held_queues list of all concurrency queues it holds tickets in, which is used to release them when the task completes.

Multiple limits per job

Jobs can specify multiple limits (concurrency, floating concurrency, and rate limits) which are checked in order. enqueue_limit_task_at_index loops through them, accumulating granted queues as it goes:

Each granted concurrency limit adds its queue to the held_queues list.
If any limit blocks, the function returns early. The job resumes from that limit index when the blocking condition clears.
Rate limits work differently — they create a CheckRateLimit task, and when that passes, the limit loop continues from where it left off with the accumulated held_queues.
When all limits pass, the final RunAttempt task gets all the held_queues.

On retry, limit processing restarts from scratch — the job must re-acquire all concurrency tickets.

Floating max concurrency limits

Silo supports concurrency queues who’s max concurrency is adjusted continuously over time. This allows queues to be adjusted by operators, as well as by automated controllers trying to achieve some setpoint in some other system, like rate limit usage of a downstream API.

Floating max concurrency limits are specified by each job when the job is enqueued, which puts the job in that concurrency queue. The max concurrency for that queue is defaulted to an initial value, and then periodically refreshed. If too many jobs are currently running, no more will start until the queue is below its new max concurrency, and likewise if the queue now has more headroom, more jobs will start. The max concurrency value for the queue is refreshed by sending a special type of refresh task to any listening Silo worker, which can then run whatever fancy business logic necessary to compute a new value.

We’re careful to only push refresh tasks for queues that are actively in use.

Gubernator rate limiting

Incoming jobs may specify that they need to pass a Gubernator rate limit to proceed. For each attempt, Silo will make RPCs to Gubernator to check these rate limits before allowing the attempt to run. This is done by enqueuing and running a task that polls the rate limit in Gubernator. If the rate limit passes, we enqueue the actual task to run the attempt, and if it doesn’t, we re-enqueue a task for the future to poll the rate limit again.

Big decisions

Do we group data by tenant or not?

Silo currently groups all the data for a given tenant into the same SlateDB instance. This has major performance benefits, as all transactions for the hot path can transact locally, in-memory, on the one compute node that has that tenant’s shard open. This allows for much higher throughput and lower resource utilization than something that uses consensus or distributed transactions for every state change. When using a local WAL for maximum performance, permanent shard leases ensure the node retains shard ownership through crashes, allowing WAL recovery on restart without data loss.

There’s a couple downsides to this:

there is a maximum scale any one tenant can reach. It is different for different workloads, but we’ve measured it to be about 4k jobs / second / shard.

The alternative would be to not group data by tenant, and instead spread every tenant’s data out across a subset (or all) shards in the cluster. This would allow for much higher scale per individual tenant, as you could bring the full might of the cluster to bear on a single tenant’s workload. We deemed the downsides of this approach too severe however:

Silo’s concurrency queue primitive offers global coordination, where jobs on different shards would need to collaborate to only let the max concurrency proceed at any one time. With cross-shard data storage, we’d need to build an expensive and complicated distributed transaction or inbox/outbox system for mutating data on different shards.
Introspection queries get more expensive, as now all of them need to do a scatter-gather to check all data on all shards in the system, rather than just routing directly to the shard storing data for a given tenant.

Temporal and Restate both take the more fundamentally scalable approach of not grouping data by tenant, but offer a higher cost per state change because of this.