Observability
Silo exposes Prometheus metrics for monitoring system health, performance, and capacity. This guide covers the available metrics, how to configure the metrics endpoint, and recommended alerting strategies.
Metrics Endpoint
Section titled “Metrics Endpoint”Silo exposes metrics in Prometheus format on a separate HTTP port. Configure the address in your configuration file:
[metrics]enabled = trueaddr = "0.0.0.0:9090"Metrics are available at the /metrics endpoint:
curl http://localhost:9090/metricsAvailable Metrics
Section titled “Available Metrics”Job Lifecycle Metrics
Section titled “Job Lifecycle Metrics”These metrics track jobs as they flow through the system.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_jobs_enqueued_total | Counter | shard, tenant | Total number of jobs enqueued |
silo_jobs_dequeued_total | Counter | shard, task_group | Total number of tasks dequeued for execution |
silo_jobs_completed_total | Counter | shard, status | Total jobs completed. Status is succeeded, failed, or cancelled |
silo_job_attempts_total | Counter | shard, task_group, is_retry | Total job attempts started. is_retry=true for attempts after the first |
silo_job_wait_time_seconds | Histogram | shard, task_group | Time jobs spent in queue before being dequeued (enqueue to dequeue latency) |
Key insights:
- Compare
silo_jobs_enqueued_totalvssilo_jobs_dequeued_totalto detect queue buildup - High
silo_job_wait_time_secondsindicates workers can’t keep up with incoming jobs - Track
is_retry=trueinsilo_job_attempts_totalto monitor retry rates
Lease Metrics
Section titled “Lease Metrics”Leases represent tasks actively being processed by workers.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_task_leases_active | Gauge | shard, task_group | Number of tasks currently leased to workers |
silo_ready_to_start_latency_ms | Histogram | shard, task_group | Time between when a task became ready and when it was first leased (in milliseconds) |
silo_lease_reaper_duration_seconds | Histogram | shard | Duration of expired lease reaper scan operations |
silo_lease_reaper_scans_total | Counter | shard | Total number of expired lease reaper scan operations |
Key insights:
silo_task_leases_activeshows in-flight work at any given moment- Sudden drops may indicate worker crashes or network issues
- Compare against worker count to understand utilization
silo_ready_to_start_latency_msmeasures scheduling delay — high values indicate workers aren’t polling fast enough or broker scan intervals are too long. Unlikesilo_job_wait_time_seconds(which measures total time from enqueue), this metric isolates the time a task sat ready but unleasedsilo_lease_reaper_duration_secondstracks how long the expired lease reaper takes to scan all leases per shard. High values indicate a large number of active leases or database pressure, and can contribute to increased CPU usage
Broker Metrics
Section titled “Broker Metrics”The task broker maintains an in-memory buffer of ready tasks for efficient dequeue operations.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_broker_buffer_size | Gauge | shard, task_group | Number of tasks in the broker’s in-memory buffer per task group |
silo_broker_inflight_size | Gauge | shard, task_group | Tasks claimed but not yet durably leased per task group |
silo_broker_scan_duration_seconds | Histogram | shard | Duration of broker task scanning operations |
silo_broker_scans_total | Counter | shard | Total number of broker task scan operations |
silo_broker_scan_tasks_read_total | Counter | shard, task_group, outcome | Tasks read during scans, broken down by outcome: inserted, skipped_future, skipped_inflight, skipped_tombstone, skipped_already_buffered, skipped_defunct |
silo_broker_tombstone_count | Gauge | shard, task_group | Number of ack tombstones currently held by the broker |
Key insights:
silo_broker_buffer_sizenear 0 with pending work may indicate scan issues- Break down
silo_broker_buffer_sizebytask_groupto identify which queues are starved vs saturated - High
silo_broker_scan_duration_secondssuggests database pressure silo_broker_inflight_sizeshould be transiently low; persistent high values indicate dequeue bottlenecks- A high rate of
silo_broker_scan_tasks_read_totalwithoutcome="skipped_tombstone"indicates the scanner is repeatedly encountering recently-acked keys that haven’t been compacted away yet - Growing
silo_broker_tombstone_countmay signal that dequeued task keys are persisting in the DB longer than expected
Shard & Coordination Metrics
Section titled “Shard & Coordination Metrics”These metrics track distributed shard ownership across the cluster.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_shards_owned | Gauge | - | Number of shards owned by this node (from coordinator) |
silo_coordination_shards_open | Gauge | - | Number of shards currently open in this process |
Key insights:
silo_shards_ownedshould matchsilo_coordination_shards_openafter convergence- Discrepancies indicate shard acquisition/release in progress
- Use for capacity planning: total shards / nodes = shards per node
Concurrency Metrics
Section titled “Concurrency Metrics”Concurrency limits control how many jobs with the same concurrency key can run simultaneously.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_concurrency_tickets_granted_total | Counter | - | Total concurrency tickets granted |
Key insights:
- Track
silo_concurrency_tickets_granted_totalrate to understand throughput through limited queues
gRPC Metrics
Section titled “gRPC Metrics”These metrics track the gRPC API performance.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_grpc_requests_total | Counter | method, status | Total gRPC requests by method and status |
silo_grpc_request_duration_seconds | Histogram | method | gRPC request latency by method |
Key insights:
- Monitor error rates via
statuslabel (look for non-OK statuses) silo_grpc_request_duration_secondshelps identify slow operations- High latency on
LeaseTasksmay indicate database or broker issues
SlateDB Storage Metrics
Section titled “SlateDB Storage Metrics”Silo uses SlateDB as its underlying embedded key-value storage engine. These metrics expose SlateDB’s internal statistics for monitoring storage-layer health and performance.
Database Operations
Section titled “Database Operations”| Metric | Type | Labels | Description |
|---|---|---|---|
silo_slatedb_get_requests_total | Counter | shard | Total number of GET (read) requests to SlateDB |
silo_slatedb_scan_requests_total | Counter | shard | Total number of scan (range query) requests |
silo_slatedb_write_ops_total | Counter | shard | Total number of individual write operations |
silo_slatedb_write_batch_count_total | Counter | shard | Total number of write batches |
silo_slatedb_flush_requests_total | Counter | shard | Total number of flush requests to SlateDB |
silo_slatedb_backpressure_count_total | Counter | shard | Number of times writes were blocked by back-pressure |
silo_slatedb_total_mem_size_bytes | Gauge | shard | Total memory usage of SlateDB (memtables, WAL buffers, etc.) |
Key insights:
- High
silo_slatedb_backpressure_count_totalindicates the storage layer is under write pressure - Compare
silo_slatedb_write_ops_totaltosilo_slatedb_write_batch_count_totalto understand batching efficiency silo_slatedb_total_mem_size_byteshelps track memory pressure per shard — sudden increases may indicate write spikes or slow flushes
WAL (Write-Ahead Log) Metrics
Section titled “WAL (Write-Ahead Log) Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
silo_slatedb_wal_buffer_estimated_bytes | Gauge | shard | Estimated bytes buffered in the WAL buffer |
silo_slatedb_wal_buffer_flushes_total | Counter | shard | Total number of WAL buffer flushes |
silo_slatedb_immutable_memtable_flushes_total | Counter | shard | Total number of immutable memtable flushes to SSTs |
Key insights:
silo_slatedb_wal_buffer_estimated_bytesshows pending writes not yet durably flushed- Monitor
silo_slatedb_immutable_memtable_flushes_totalrate to understand flush frequency - High WAL buffer sizes may indicate slow object storage writes
SST Filter (Bloom Filter) Metrics
Section titled “SST Filter (Bloom Filter) Metrics”SlateDB uses bloom filters to avoid unnecessary SST reads. These metrics track filter effectiveness.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_slatedb_sst_filter_positives_total | Counter | shard | True positives: key exists and filter said yes |
silo_slatedb_sst_filter_negatives_total | Counter | shard | True negatives: key absent and filter said no (avoided read) |
silo_slatedb_sst_filter_false_positives_total | Counter | shard | False positives: key absent but filter said yes (wasted read) |
Key insights:
- High
silo_slatedb_sst_filter_negatives_totalrate indicates filters are effective at avoiding reads silo_slatedb_sst_filter_false_positives_total/ total lookups gives the false positive rate- A false positive rate above 1-2% may indicate bloom filter tuning is needed
Compaction Metrics
Section titled “Compaction Metrics”SlateDB periodically compacts SST files to reclaim space and improve read performance.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_slatedb_bytes_compacted_total | Counter | shard | Total number of bytes compacted |
silo_slatedb_running_compactions | Gauge | shard | Number of compactions currently running |
silo_slatedb_last_compaction_ts_seconds | Gauge | shard | Unix timestamp of the last compaction |
silo_slatedb_l0_sst_count | Gauge | shard | Number of Level-0 SSTs (high values indicate compaction lag) |
Key insights:
silo_slatedb_running_compactions> 0 indicates compaction is actively working- Monitor
silo_slatedb_bytes_compacted_totalrate to understand compaction throughput - If
silo_slatedb_last_compaction_ts_secondsis very old, compaction may be stuck or disabled - High
silo_slatedb_l0_sst_countmeans scans must merge across many unsorted files, directly increasing scan latency. This is the most important metric for diagnosing slow scans
Cache Metrics
Section titled “Cache Metrics”SlateDB maintains an in-memory block cache for SST data blocks, index blocks, and bloom filters. These metrics track cache effectiveness — low hit rates indicate that reads are falling through to object storage.
| Metric | Type | Labels | Description |
|---|---|---|---|
silo_slatedb_cache_data_block_hit_total | Counter | shard | Data block cache hits |
silo_slatedb_cache_data_block_miss_total | Counter | shard | Data block cache misses (requires object storage read) |
silo_slatedb_cache_index_hit_total | Counter | shard | Index block cache hits |
silo_slatedb_cache_index_miss_total | Counter | shard | Index block cache misses |
silo_slatedb_cache_filter_hit_total | Counter | shard | Bloom filter cache hits |
silo_slatedb_cache_filter_miss_total | Counter | shard | Bloom filter cache misses |
Key insights:
- Compute hit rates as
hit / (hit + miss)for each cache tier - Low data block hit rate combined with high L0 SST count strongly suggests compaction isn’t keeping up
- Low filter/index cache hit rates may indicate the cache is too small for the working set — consider increasing SlateDB’s block cache size
- Filter cache misses are especially expensive because they force a full block read to check for key existence
Tracing
Section titled “Tracing”Silo supports OpenTelemetry tracing via the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. When set, Silo exports traces using the OTLP protocol:
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 silo -c config.tomlKey spans include:
enqueue- Job enqueue operationsdequeue- Task dequeue and lease creationreport_outcome- Attempt completion reportingconcurrency.grant/concurrency.release- Concurrency ticket operations
Logging
Section titled “Logging”Silo uses structured logging via tracing. Configure the log output format in your configuration file:
[logging]format = "json" # "text" (default, human-readable) or "json" (structured)Control log verbosity with the RUST_LOG environment variable:
RUST_LOG=info silo -c config.toml # General info levelRUST_LOG=silo::coordination=trace silo -c config.toml # Trace a specific moduleKey log events to monitor:
shard opened/closed- Shard lifecyclelease expired- Worker crashes or timeoutsrate limit check failed- Gubernator connectivity issuesfailed to close shard- Graceful shutdown problems
CPU Profiling
Section titled “CPU Profiling”Silo supports on-demand CPU profiling for production debugging. Profiles are captured using pprof-rs, a low-overhead sampling profiler, and returned in standard pprof protobuf format.
Capturing a Profile
Section titled “Capturing a Profile”Use siloctl to capture a CPU profile from a running node:
siloctl -a http://silo-node:7450 profile --duration 30Options:
--duration,-d: Profile duration in seconds (1-300, default 30)--frequency,-f: Sampling frequency in Hz (1-1000, default 100)--output,-o: Output file path (default:profile-{timestamp}.pb.gz)
Example with all options:
siloctl -a http://silo-node:7450 profile \ --duration 60 \ --frequency 250 \ --output my-profile.pb.gzAnalyzing Profiles
Section titled “Analyzing Profiles”Profiles are saved in pprof protobuf format (gzip compressed). Analyze with either of these tools:
Using pprof CLI:
# Install pprof if neededgo install github.com/google/pprof@latest
# Open interactive web UIpprof -http=:8080 profile-1706123456.pb.gzUsing go tool pprof:
go tool pprof -http=:8080 profile-1706123456.pb.gzBoth tools open an interactive web UI with:
- Flame graphs for visualizing hot paths
- Top functions by CPU time
- Call graphs showing function relationships
- Source code annotation (if source is available)
Production Considerations
Section titled “Production Considerations”- Low overhead: Profiling uses sampling at the configured frequency, typically adding only 1-2% overhead
- Safe defaults: The default 100Hz frequency is safe for production use
- Higher detail: Increase frequency (up to 1000Hz) for more detail, but expect slightly higher overhead
- Profile size: Typical 30-second profiles are 10-100KB compressed
- Single profile at a time: Only one profile can be captured per node at a time; concurrent requests will wait
Common Profiling Scenarios
Section titled “Common Profiling Scenarios”Investigating high CPU usage:
# Capture a 60-second profile during high loadsiloctl -a http://silo-node:7450 profile --duration 60
# Open in pprof and look at the flame graphpprof -http=:8080 profile-*.pb.gzComparing before/after a change:
# Capture baseline profilesiloctl profile --output baseline.pb.gz
# Deploy change, then capture new profilesiloctl profile --output after-change.pb.gz
# Compare using pprof's diff modepprof -http=:8080 -diff_base=baseline.pb.gz after-change.pb.gz