Introduction

Silo is a durable, horizontally-scalable background job queue built on top of object storage via SlateDB. It brokers work between your application and your workers over gRPC, storing all job data in object storage for durability and low cost.

Why Silo?

Most background job systems fall into two camps:

Redis-based queues (Sidekiq, BullMQ, Celery) are simple and fast, but aren’t truly durable. A Redis crash or eviction can lose jobs that were already acknowledged.
Workflow engines (Temporal, Restate) are highly durable and scalable, but are complex and expensive to operate, with heavyweight client-side execution environments and multi-step workflow semantics.

Silo sits in between. It has the simplicity and speed of a traditional job queue with the durability and horizontal scalability of a workflow engine. Jobs are stored in object storage, so there are no Redis disappearing acts, and the compute layer is stateless and elastic.

Key Features

Durable: job data is stored in object storage. No data loss on crashes or scaling events.
Horizontally scalable: add nodes to increase throughput. Silo automatically distributes shards across nodes.
Multi-tenant: high-cardinality tenancy is built in. Each tenant’s data lives on a single shard for fast, local transactions.
Concurrency limits: built-in concurrency and rate limiting with high cardinality keys and dynamic floating limits.
Job lifecycle management: retries with exponential backoff, future scheduling, expediting, cancellation, restart, and deletion.
Job results: workers can return results that are stored and retrievable by the enqueuing process.
Observability: Prometheus metrics, OpenTelemetry tracing, structured logging, and on-demand CPU and heap profiling.
SQL queries: inspect jobs with SQL via the built-in query engine.
Web UI: operator-facing dashboard for cluster health, queue inspection, and debugging.

Core Concepts

Jobs

A job is a unit of work to be processed. Each job has a payload (arbitrary JSON), a task group, an optional tenant, and optional configuration for retries, priority, scheduling, and limits.

Task Groups

A task group determines which workers process a job. Workers poll for tasks from a specific task group, so you can route different types of work to specialized worker pools and scale them independently.

Tenants

A tenant is an isolation boundary for job data. All data for a tenant lives on a single shard, enabling fast local transactions. Tenants are routed to shards using hash-based routing (XXH64), which distributes tenants uniformly across the hash space regardless of naming patterns. Tenants can be very high cardinality (millions), but each individual tenant is bounded by a single shard’s throughput (roughly 4,000 jobs/second).

Shards

Silo partitions data across shards. Each shard owns a range of the 64-bit hash space. When a job is enqueued for a tenant, Silo hashes the tenant ID (XXH64) and routes it to the shard whose range contains that hash. Shards are backed by their own SlateDB instance in object storage, assigned to compute nodes, and can be split dynamically as load grows. Shard ownership is coordinated via etcd or Kubernetes.

Workers

Workers are your application processes that poll Silo for tasks, execute them, and report outcomes. Workers communicate with Silo over gRPC and can be written in any language. Silo provides a TypeScript client with built-in worker support.

How It Works

Your application enqueues a job by calling Silo’s gRPC API with a payload and task group.
Silo hashes the tenant ID and stores the job in the appropriate shard, making it available for dequeue.
A worker polling that task group leases the task and processes it.
The worker reports the outcome (success with optional result, or failure) back to Silo.
On failure, Silo automatically retries according to the job’s retry policy.

Acknowledgements

Silo’s design and implementation have been influenced by several projects and papers:

SlateDB — Silo’s storage engine, providing LSM-tree storage on top of object storage
Temporal — inspiration for durable execution semantics and workflow reliability patterns
BullMQ — a mature Node.js job queue whose API and feature set informed Silo’s developer experience
Resque — the original Redis-backed background job system for Ruby, a foundational influence on job queue design
Que — a Ruby job queue using PostgreSQL advisory locks, demonstrating that durable storage can back a high-performance queue
QuiCK — Apple’s Queues in FoundationDB paper, which influenced Silo’s approach to building a queue on top of an ordered key-value store
Bigtable — Google’s distributed storage system, which influenced Silo’s approach to hash-based sharding and tablet-style data partitioning
Redis — inspiration for executing small business logic scripts close to the data, akin to Redis server-side Lua scripting

Next Steps

Enqueueing Jobs — learn how to create jobs with payloads, priorities, scheduling, and limits
Running Workers — set up workers to process jobs
Concurrency Limits — control job execution with concurrency and rate limits
Deployment — configure Silo for production with object storage and clustering
Server Configuration — full reference for all configuration options