Skip to content

Observability

Silo exposes Prometheus metrics for monitoring system health, performance, and capacity. This guide covers the available metrics, how to configure the metrics endpoint, and recommended alerting strategies.

Silo exposes metrics in Prometheus format on a separate HTTP port. Configure the address in your configuration file:

[metrics]
enabled = true
addr = "0.0.0.0:9090"

Metrics are available at the /metrics endpoint:

Terminal window
curl http://localhost:9090/metrics

These metrics track jobs as they flow through the system.

MetricTypeLabelsDescription
silo_jobs_enqueued_totalCountershard, tenantTotal number of jobs enqueued
silo_jobs_dequeued_totalCountershard, task_groupTotal number of tasks dequeued for execution
silo_jobs_completed_totalCountershard, statusTotal jobs completed. Status is succeeded, failed, or cancelled
silo_job_attempts_totalCountershard, task_group, is_retryTotal job attempts started. is_retry=true for attempts after the first
silo_job_wait_time_secondsHistogramshard, task_groupTime jobs spent in queue before being dequeued (enqueue to dequeue latency)

Key insights:

  • Compare silo_jobs_enqueued_total vs silo_jobs_dequeued_total to detect queue buildup
  • High silo_job_wait_time_seconds indicates workers can’t keep up with incoming jobs
  • Track is_retry=true in silo_job_attempts_total to monitor retry rates

Leases represent tasks actively being processed by workers.

MetricTypeLabelsDescription
silo_task_leases_activeGaugeshard, task_groupNumber of tasks currently leased to workers

Key insights:

  • This metric shows in-flight work at any given moment
  • Sudden drops may indicate worker crashes or network issues
  • Compare against worker count to understand utilization

The task broker maintains an in-memory buffer of ready tasks for efficient dequeue operations.

MetricTypeLabelsDescription
silo_broker_buffer_sizeGaugeshardNumber of tasks in the broker’s in-memory buffer
silo_broker_inflight_sizeGaugeshardTasks claimed but not yet durably leased
silo_broker_scan_duration_secondsHistogramshardDuration of broker task scanning operations

Key insights:

  • silo_broker_buffer_size near 0 with pending work may indicate scan issues
  • High silo_broker_scan_duration_seconds suggests database pressure
  • silo_broker_inflight_size should be transiently low; persistent high values indicate dequeue bottlenecks

These metrics track distributed shard ownership across the cluster.

MetricTypeLabelsDescription
silo_shards_ownedGauge-Number of shards owned by this node (from coordinator)
silo_coordination_shards_openGauge-Number of shards currently open in this process

Key insights:

  • silo_shards_owned should match silo_coordination_shards_open after convergence
  • Discrepancies indicate shard acquisition/release in progress
  • Use for capacity planning: total shards / nodes = shards per node

Concurrency limits control how many jobs with the same concurrency key can run simultaneously.

MetricTypeLabelsDescription
silo_concurrency_holdersGaugetenant, queueActive concurrency ticket holders per queue
silo_concurrency_tickets_granted_totalCounter-Total concurrency tickets granted

Key insights:

  • silo_concurrency_holders at limit indicates jobs are waiting for capacity
  • Track silo_concurrency_tickets_granted_total rate to understand throughput through limited queues

These metrics track the gRPC API performance.

MetricTypeLabelsDescription
silo_grpc_requests_totalCountermethod, statusTotal gRPC requests by method and status
silo_grpc_request_duration_secondsHistogrammethodgRPC request latency by method

Key insights:

  • Monitor error rates via status label (look for non-OK statuses)
  • silo_grpc_request_duration_seconds helps identify slow operations
  • High latency on LeaseTasks may indicate database or broker issues

Silo uses SlateDB as its underlying embedded key-value storage engine. These metrics expose SlateDB’s internal statistics for monitoring storage-layer health and performance.

MetricTypeLabelsDescription
silo_slatedb_get_requests_totalGaugeshardTotal number of GET (read) requests to SlateDB
silo_slatedb_scan_requests_totalGaugeshardTotal number of scan (range query) requests
silo_slatedb_write_ops_totalGaugeshardTotal number of individual write operations
silo_slatedb_write_batch_count_totalGaugeshardTotal number of write batches
silo_slatedb_backpressure_count_totalGaugeshardNumber of times writes were blocked by back-pressure

Key insights:

  • High silo_slatedb_backpressure_count_total indicates the storage layer is under write pressure
  • Compare silo_slatedb_write_ops_total to silo_slatedb_write_batch_count_total to understand batching efficiency
MetricTypeLabelsDescription
silo_slatedb_wal_buffer_estimated_bytesGaugeshardEstimated bytes buffered in the WAL buffer
silo_slatedb_wal_buffer_flushes_totalGaugeshardTotal number of WAL buffer flushes
silo_slatedb_immutable_memtable_flushes_totalGaugeshardTotal number of immutable memtable flushes to SSTs

Key insights:

  • silo_slatedb_wal_buffer_estimated_bytes shows pending writes not yet durably flushed
  • Monitor silo_slatedb_immutable_memtable_flushes_total rate to understand flush frequency
  • High WAL buffer sizes may indicate slow object storage writes

SlateDB uses bloom filters to avoid unnecessary SST reads. These metrics track filter effectiveness.

MetricTypeLabelsDescription
silo_slatedb_sst_filter_positives_totalGaugeshardTrue positives: key exists and filter said yes
silo_slatedb_sst_filter_negatives_totalGaugeshardTrue negatives: key absent and filter said no (avoided read)
silo_slatedb_sst_filter_false_positives_totalGaugeshardFalse positives: key absent but filter said yes (wasted read)

Key insights:

  • High silo_slatedb_sst_filter_negatives_total rate indicates filters are effective at avoiding reads
  • silo_slatedb_sst_filter_false_positives_total / total lookups gives the false positive rate
  • A false positive rate above 1-2% may indicate bloom filter tuning is needed

SlateDB periodically compacts SST files to reclaim space and improve read performance.

MetricTypeLabelsDescription
silo_slatedb_bytes_compacted_totalGaugeshardTotal number of bytes compacted
silo_slatedb_running_compactionsGaugeshardNumber of compactions currently running
silo_slatedb_last_compaction_ts_secondsGaugeshardUnix timestamp of the last compaction

Key insights:

  • silo_slatedb_running_compactions > 0 indicates compaction is actively working
  • Monitor silo_slatedb_bytes_compacted_total rate to understand compaction throughput
  • If silo_slatedb_last_compaction_ts_seconds is very old, compaction may be stuck or disabled

Silo supports OpenTelemetry tracing via the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. When set, Silo exports traces using the OTLP protocol:

Terminal window
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 silo -c config.toml

Key spans include:

  • enqueue - Job enqueue operations
  • dequeue - Task dequeue and lease creation
  • report_outcome - Attempt completion reporting
  • concurrency.grant / concurrency.release - Concurrency ticket operations

Silo uses structured logging via tracing. Configure the log output format in your configuration file:

[logging]
format = "json" # "text" (default, human-readable) or "json" (structured)

Control log verbosity with the RUST_LOG environment variable:

Terminal window
RUST_LOG=info silo -c config.toml # General info level
RUST_LOG=silo::coordination=trace silo -c config.toml # Trace a specific module

Key log events to monitor:

  • shard opened/closed - Shard lifecycle
  • lease expired - Worker crashes or timeouts
  • rate limit check failed - Gubernator connectivity issues
  • failed to close shard - Graceful shutdown problems

Silo supports on-demand CPU profiling for production debugging. Profiles are captured using pprof-rs, a low-overhead sampling profiler, and returned in standard pprof protobuf format.

Use siloctl to capture a CPU profile from a running node:

Terminal window
siloctl -a http://silo-node:7450 profile --duration 30

Options:

  • --duration, -d: Profile duration in seconds (1-300, default 30)
  • --frequency, -f: Sampling frequency in Hz (1-1000, default 100)
  • --output, -o: Output file path (default: profile-{timestamp}.pb.gz)

Example with all options:

Terminal window
siloctl -a http://silo-node:7450 profile \
--duration 60 \
--frequency 250 \
--output my-profile.pb.gz

Profiles are saved in pprof protobuf format (gzip compressed). Analyze with either of these tools:

Using pprof CLI:

Terminal window
# Install pprof if needed
go install github.com/google/pprof@latest
# Open interactive web UI
pprof -http=:8080 profile-1706123456.pb.gz

Using go tool pprof:

Terminal window
go tool pprof -http=:8080 profile-1706123456.pb.gz

Both tools open an interactive web UI with:

  • Flame graphs for visualizing hot paths
  • Top functions by CPU time
  • Call graphs showing function relationships
  • Source code annotation (if source is available)
  • Low overhead: Profiling uses sampling at the configured frequency, typically adding only 1-2% overhead
  • Safe defaults: The default 100Hz frequency is safe for production use
  • Higher detail: Increase frequency (up to 1000Hz) for more detail, but expect slightly higher overhead
  • Profile size: Typical 30-second profiles are 10-100KB compressed
  • Single profile at a time: Only one profile can be captured per node at a time; concurrent requests will wait

Investigating high CPU usage:

Terminal window
# Capture a 60-second profile during high load
siloctl -a http://silo-node:7450 profile --duration 60
# Open in pprof and look at the flame graph
pprof -http=:8080 profile-*.pb.gz

Comparing before/after a change:

Terminal window
# Capture baseline profile
siloctl profile --output baseline.pb.gz
# Deploy change, then capture new profile
siloctl profile --output after-change.pb.gz
# Compare using pprof's diff mode
pprof -http=:8080 -diff_base=baseline.pb.gz after-change.pb.gz