Internals
Silo is built on top of SlateDB, which is a low-level key value store. Silo implements and maintain higher-level data-structurs within this KV store.
Entities
Section titled “Entities”One unit of work that should be attempted until completion or retries exhausted. Moves through states of Scheduled, Waiting, Running, Succeeded, Retrying, Failed, or Cancelled.
Attempt
Section titled “Attempt”One attempt to run a unit of work. Moves through states of Running, Succeeded, Failed, or Cancelled.
One item that we need to dequeue and run. Usually, its a task for a worker to run an attempt, but there are other task types that represent internal operations that should happen after a point in time in the future. Most tasks need to be processed by the external worker that is polling Silo for what to do next, but some tasks are handled purely internally.
Task lease
Section titled “Task lease”One task that has been leased by a worker. Upon leasing, a task is moved from the task queue into the task leases. A worker must complete the task or heartbeat for it before the lease expired, or else the system will assume the worker has crashed.
Tenancy
Section titled “Tenancy”All of Silo is built to house job data for multiple tenants, if necessary. All data for a single tenant lives in the same SlateDB database which ensures fast transactions for every hot-path operation. An individual tenant’s data can’t be split across shards.
Key/value layout
Section titled “Key/value layout”Silo uses the storekey crate for binary key encoding. This provides lexicographically-sortable binary keys that maintain correct ordering when compared as byte strings. Each key type has a unique single-byte prefix to ensure namespace isolation.
Key types and their prefixes
Section titled “Key types and their prefixes”| Prefix | Name | Description |
|---|---|---|
| 0x01 | Job info | Stores job payloads: (tenant, job_id) |
| 0x02 | Job status | Stores job status: (tenant, job_id) |
| 0x03 | Status/time index | Secondary index for jobs by status, newest-first: (tenant, status, inverted_timestamp, job_id) |
| 0x04 | Metadata index | Secondary index for jobs by metadata key/value: (tenant, key, value, job_id) |
| 0x05 | Task | Work items in task order: (task_group, start_time_ms, priority, job_id, attempt) |
| 0x06 | Lease | Leased tasks: (task_id) |
| 0x07 | Attempt | Attempt records: (tenant, job_id, attempt_number) |
| 0x08 | Concurrency request | Jobs waiting for concurrency slots: (tenant, queue, start_time_ms, priority, request_id) |
| 0x09 | Concurrency holder | Tasks holding concurrency slots: (tenant, queue, task_id) |
| 0x0A | Job cancelled | Cancellation flags: (tenant, job_id) |
| 0x0B | Floating limit | Dynamic concurrency limit state: (tenant, queue_key) |
| 0xF0-0xFF | System | Internal counters, cleanup state, and metadata |
Clustering approach
Section titled “Clustering approach”Silo works by separating the compute and storage tiers. Compute is provided by stateless or semi-stateless cloud compute, and storage is provided by cloud object storage. Compute can be deployed, rescheduled, or scaled at will. Data durability comes from object storage, not from the compute nodes themselves.
Within a cluster, all data is sharded. At any one time, each shard is opened and exclusively read or written to by a single compute node. The number of shards can grow over time by splitting shards. These shard leases are permanent — they aren’t locks with liveness checking. Instead, they persist through node crashes to prevent WAL data loss when using the split WAL mode. See Permanent shard leases below for details.
Shard splitting
Section titled “Shard splitting”When a shard becomes overloaded, it can be split into two child shards to distribute the load. Splits divide the tenant ID keyspace at a chosen split point:
- Left child: owns tenant IDs
[parent_start, split_point) - Right child: owns tenant IDs
[split_point, parent_end)
The split point is always a tenant ID boundary, ensuring that all data for any individual tenant remains together in a single shard. This preserves tenant atomicity—Silo’s concurrency queues and transactions operate within a single shard, so a tenant’s data can never be divided.
The split operation proceeds through several phases:
- SplitRequested — The split is initiated on the node owning the shard.
- SplitPausing — Traffic to the parent shard is paused (clients receive retryable errors).
- SplitCloning — The underlying SlateDB database is cloned for both child shards, then the shard map is atomically updated to replace the parent with the two children. The shard map update is the commit point.
- SplitComplete — Children are active and ready to serve traffic. The split record is cleaned up.
After a split completes, each child shard contains cloned data from the parent. A background cleanup process then removes any data that falls outside each child’s new range (defunct keys), followed by a compaction to reclaim storage.
For crash recovery, the shard map is the source of truth. If a crash occurs before the shard map update (commit point), the split is abandoned and the parent remains intact. Any orphaned clone databases are left behind but don’t affect correctness—they’re just wasted storage that can be garbage collected later. If a crash occurs after the shard map update, the split is committed and children exist in the shard map.
The clustering approach is modelled after something similar to Google Cloud Bigtable, where individual compute node members all hold leases over some subset of the overall shards. Unlike Bigtable, there’s no central coordiantor managing which bits are assigned to which nodes. Instead, we just manage an internal hash ring that assigns the entire shard space to all the nodes, shifting around as few shards as possible when cluster membership changes.
Shard coordination plan
Section titled “Shard coordination plan”Goals:
- Elastic compute membership with fast join/leave
- Shards are the atomic unit of load; clients specify a shard on each request
- Minimal downtime on ownership changes and no split-brain
- etcd is used only for control plane (membership, locks, handover), never per-request on the data plane
Key ideas:
- Membership (ephemeral): Each node registers under
coord/members/<node_id>with a membership lease (TTL, e.g. 10s). Value includes{ addr, started_at, node_id, capabilities }. - Ring computation: All nodes watch
coord/members/and deterministically build a consistent hash ring (e.g. Ketama with N virtual nodes per member). - Ownership: The active owner of shard
<S>is the node that holds the etcd keycoord/shards/<S>/owner. Value:{ owner_id, owner_addr, acquired_at }. Unlike membership, shard ownership keys are not tied to any TTL-based lease — they persist until explicitly released (graceful shutdown) or force-released (operator action). - Data plane independence: Nodes serve a shard only if they currently hold its
ownerkey and are active members. Requests never consult etcd; they rely on local in-memory ownership state maintained via watches/keepalives.
Node lifecycle:
-
Join
- Create a membership lease (TTL ~10s, keepalive ~2s).
- Put
coord/members/<node_id>with the membership lease. - Scan for any shard leases already held by this node from a previous run (reclaim existing leases for WAL recovery).
- Build ring from current
coord/members/*and compute desired shard set. - Reconcile: for each desired shard, try to acquire
coord/shards/<S>/ownervia a put-if-absent transaction. Only start serving a shard after this key has been written. - Start watches on
coord/members/*andcoord/shards/<S>/ownerfor shards you own or are next-in-line for.
-
Steady state
- Maintain keepalives for the membership lease. If keepalives fail (channel closes or prolonged error), immediately pause serving all owned shards to avoid split-brain.
- On membership change event, recompute ring and run reconciliation: acquire new shards, release shards no longer owned.
-
Graceful leave
- Delete
coord/members/<node_id>(using the membership lease), triggering ring recompute so contenders begin acquisition retries. - Quiesce briefly per shard (1–2s) to drain in-flight work.
- Explicitly release each
coord/shards/*/ownerkey for this node (delete-if-owner); new owners acquire on their next retry.
Safe ring changes
Section titled “Safe ring changes”On ring change, we must carefully change ownership of shards with minimal downtime. To do this, we ensure that the new owners start trying to acquire the shard lock before the old owners release it.
The flow:
- A node either adds or removes its membership details under
coord/members/<node_id> - All nodes discover this membership change, and update their in-memory hash rings, which shifts which shards they are responsible for.
- New shard owners immediately start trying to acquire the lease for the shards they want to own. Each runs a jittered loop: try
Putofcoord/shards/<S>/ownerusing “create only if not exists” semantics (i.e., CAS on missing key). If exists, backoff and retry; never overwrite. - Old shard owners wait a little bit, ensuring that the new shard owners discover the change and begin their retry loops in advance.
- Old shard owners start relinquishing their shard locks (explicit delete-if-owner), and stop serving requests for those shards when that’s happened.
Crash case: the previous owner’s shard lease persists (see Permanent shard leases). An operator must force-release the lease via siloctl shard force-release before another node can acquire it.
Correctness and split-brain avoidance
Section titled “Correctness and split-brain avoidance”Only the node holding coord/shards/<S>/owner and with active membership may serve <S>. All others must return NOT_OWNER immediately. Nodes must pause serving a shard if their membership keepalive fails, even if they still hold the shard lease, because another node might need to take over.
Failure modes:
- Crash: shard leases persist (permanent). The crashed node retains its shard leases, preventing another node from accidentally opening the shard and losing WAL data. When the node restarts with the same
node_id, it reclaims its leases and recovers the WAL. If the node is permanently lost, an operator must force-release the shard leases viasiloctl shard force-release. - Network partition: owner loses membership keepalive; it must cease serving. Unlike a crash, the shard lease is still held, so no other node can acquire it until it’s released or force-released.
- Simultaneous join: consistent hashing minimizes movement; CAS on
ownerguarantees single winner.
Permanent shard leases
Section titled “Permanent shard leases”Shard leases are permanent — they are not tied to any TTL-based lease and survive node crashes. This is critical for preventing WAL data loss when using a local filesystem for the write-ahead log.
Why permanent leases? When a node writes to a shard, data first goes to the local WAL before being flushed to object storage. If a node crashes before the WAL is flushed, that data only exists on the crashed node’s local disk. If another node were to immediately acquire the shard (as would happen with TTL-based ephemeral leases), it would open the shard from object storage and miss the unflushed WAL data, causing data loss.
With permanent leases:
- Node crashes — the shard lease persists, blocking other nodes from acquiring the shard.
- Node restarts — if the node restarts with the same
node_id(e.g., a Kubernetes StatefulSet pod), it finds its existing shard leases, reclaims them, opens the shard databases, and recovers the WAL data from local disk. No data is lost. - Node permanently lost — if a node will never come back, an operator uses
siloctl shard force-release <shard-id>to release the lease. The operator accepts that any unflushed WAL data on that node is lost. Another node then acquires the shard and starts serving it from the last state in object storage.
Configuration for permanent leases: For permanent leases to work correctly across restarts, the node_id must be stable. In Kubernetes, set node_id = "${POD_NAME}" so that a restarted StatefulSet pod re-acquires its own leases. If node_id is not configured, a random UUID is generated on each startup, meaning a restarted node won’t reclaim its previous leases.
When local WAL is not used: If the WAL is stored in object storage (same backend as the main data), permanent leases still apply but are less critical — there’s no local data to lose on crash. An operator could force-release shards from a crashed node without risking data loss in this configuration.
Request handling and redirection
Section titled “Request handling and redirection”On each RPC, the server checks a local map of owned shards. If not the owner, return NOT_OWNER with an optional hint { owner_id, owner_addr } read from the cached owner value. Clients retry a few times and may redirect using the hint. Servers never query the coordinator on the request path.
Concurrency Queues
Section titled “Concurrency Queues”Concurrency queues limit how many jobs with the same queue key can run simultaneously. They work like a ticket system: a job must hold a ticket for a queue before it can proceed to execution, and each queue has a limited number of tickets available.
Requesters and holders
Section titled “Requesters and holders”The two key concepts are requesters and holders:
-
A holder is a task that has been granted a ticket and is occupying one of the queue’s concurrency slots. It’s either currently running on a worker, sitting in the task queue waiting to be picked up, or in the process of being leased. Holders are stored durably in SlateDB (prefix
0x09) and also tracked in an in-memory set for fast capacity checks. -
A requester is a job that wants a ticket but the queue is full. Requesters are stored durably in SlateDB (prefix
0x08), ordered by(start_time, priority, request_id)so the most eligible job is always first in the scan. Requesters don’t have a corresponding task in the task queue — they’re parked, waiting for a holder to finish and free up a slot.
When a holder finishes (or is cancelled), the system promotes the next eligible requester into a holder — this is the core “release and grant next” operation.
Why in-memory counts?
Section titled “Why in-memory counts?”Checking concurrency capacity requires knowing how many holders exist for a queue. Scanning the durable holder keys in SlateDB on every enqueue or grant would be too slow, so we maintain an in-memory HashMap<queue, HashSet<task_id>> of current holders (ConcurrencyCounts in src/concurrency.rs). This makes capacity checks a fast set-size lookup.
The in-memory counts are lazily hydrated: the first time a queue is accessed, we scan its durable holder keys to populate the set. After that, the in-memory state is kept in sync by updating it on every grant and release. The hydration scan filters by shard range to avoid double-counting holders that were cloned during a shard split.
The tricky part is keeping in-memory counts consistent with durable state. We use an optimistic reservation strategy: in-memory counts are updated before the DB write, and if the write fails, the caller must invoke explicit rollback methods. This eliminates any TOCTOU window where two concurrent operations could both see available capacity and double-grant.
Enqueue flow
Section titled “Enqueue flow”When a job with concurrency limits is enqueued, limits are processed sequentially:
-
Immediate grant: Check the in-memory holder count against
max_concurrency. If there’s room, atomically reserve a slot in memory, write a holder record to the batch, and advance to the next limit. -
Queued: If the queue is full and the job is ready to run now, write a requester record to the batch. No task goes in the task queue — the job is parked until a slot opens.
-
Future-scheduled: If the queue is full and the job has a future start time, write a
RequestTickettask to the task queue at that start time. When it fires, it re-attempts the grant. -
All limits passed: Once every limit is satisfied, a
RunAttempttask is written to the task queue. This task carries aheld_queueslist of all concurrency queues it holds tickets in, which is used to release them when the task completes.
Multiple limits per job
Section titled “Multiple limits per job”Jobs can specify multiple limits (concurrency, floating concurrency, and rate limits) which are checked in order. enqueue_limit_task_at_index loops through them, accumulating granted queues as it goes:
- Each granted concurrency limit adds its queue to the
held_queueslist. - If any limit blocks, the function returns early. The job resumes from that limit index when the blocking condition clears.
- Rate limits work differently — they create a
CheckRateLimittask, and when that passes, the limit loop continues from where it left off with the accumulatedheld_queues. - When all limits pass, the final
RunAttempttask gets all theheld_queues.
On retry, limit processing restarts from scratch — the job must re-acquire all concurrency tickets.
Floating max concurrency limits
Section titled “Floating max concurrency limits”Silo supports concurrency queues who’s max concurrency is adjusted continuously over time. This allows queues to be adjusted by operators, as well as by automated controllers trying to achieve some setpoint in some other system, like rate limit usage of a downstream API.
Floating max concurrency limits are specified by each job when the job is enqueued, which puts the job in that concurrency queue. The max concurrency for that queue is defaulted to an initial value, and then periodically refreshed. If too many jobs are currently running, no more will start until the queue is below its new max concurrency, and likewise if the queue now has more headroom, more jobs will start. The max concurrency value for the queue is refreshed by sending a special type of refresh task to any listening Silo worker, which can then run whatever fancy business logic necessary to compute a new value.
We’re careful to only push refresh tasks for queues that are actively in use.
Gubernator rate limiting
Section titled “Gubernator rate limiting”Incoming jobs may specify that they need to pass a Gubernator rate limit to proceed. For each attempt, Silo will make RPCs to Gubernator to check these rate limits before allowing the attempt to run. This is done by enqueuing and running a task that polls the rate limit in Gubernator. If the rate limit passes, we enqueue the actual task to run the attempt, and if it doesn’t, we re-enqueue a task for the future to poll the rate limit again.
Big decisions
Section titled “Big decisions”Do we group data by tenant or not?
Section titled “Do we group data by tenant or not?”Silo currently groups all the data for a given tenant into the same SlateDB instance. This has major performance benefits, as all transactions for the hot path can transact locally, in-memory, on the one compute node that has that tenant’s shard open. This allows for much higher throughput and lower resource utilization than something that uses consensus or distributed transactions for every state change. When using a local WAL for maximum performance, permanent shard leases ensure the node retains shard ownership through crashes, allowing WAL recovery on restart without data loss.
There’s a couple downsides to this:
- there is a maximum scale any one tenant can reach. It is different for different workloads, but we’ve measured it to be about 4k jobs / second / shard.
The alternative would be to not group data by tenant, and instead spread every tenant’s data out across a subset (or all) shards in the cluster. This would allow for much higher scale per individual tenant, as you could bring the full might of the cluster to bear on a single tenant’s workload. We deemed the downsides of this approach too severe however:
- Silo’s concurrency queue primitive offers global coordination, where jobs on different shards would need to collaborate to only let the max concurrency proceed at any one time. With cross-shard data storage, we’d need to build an expensive and complicated distributed transaction or inbox/outbox system for mutating data on different shards.
- Introspection queries get more expensive, as now all of them need to do a scatter-gather to check all data on all shards in the system, rather than just routing directly to the shard storing data for a given tenant.
Temporal and Restate both take the more fundamentally scalable approach of not grouping data by tenant, but offer a higher cost per state change because of this.