Observability

Silo exposes Prometheus metrics for monitoring system health, performance, and capacity. This guide covers the available metrics, how to configure the metrics endpoint, and recommended alerting strategies.

Metrics Endpoint

Silo exposes metrics in Prometheus format on a separate HTTP port. Configure the address in your configuration file:

[metrics]
enabled = true
addr = "0.0.0.0:9090"

Metrics are available at the /metrics endpoint:

curl http://localhost:9090/metrics

Available Metrics

Job Lifecycle Metrics

These metrics track jobs as they flow through the system.

Metric	Type	Labels	Description
`silo_jobs_enqueued_total`	Counter	`shard`, `tenant`	Total number of jobs enqueued
`silo_jobs_dequeued_total`	Counter	`shard`, `task_group`	Total number of tasks dequeued for execution
`silo_jobs_completed_total`	Counter	`shard`, `status`	Total jobs completed. Status is `succeeded`, `failed`, or `cancelled`
`silo_job_attempts_total`	Counter	`shard`, `task_group`, `is_retry`	Total job attempts started. `is_retry=true` for attempts after the first
`silo_job_wait_time_seconds`	Histogram	`shard`, `task_group`	Time jobs spent in queue before being dequeued (enqueue to dequeue latency)

Key insights:

Compare silo_jobs_enqueued_total vs silo_jobs_dequeued_total to detect queue buildup
High silo_job_wait_time_seconds indicates workers can’t keep up with incoming jobs
Track is_retry=true in silo_job_attempts_total to monitor retry rates

Lease Metrics

Leases represent tasks actively being processed by workers.

Metric	Type	Labels	Description
`silo_task_leases_active`	Gauge	`shard`, `task_group`	Number of tasks currently leased to workers
`silo_ready_to_start_latency_ms`	Histogram	`shard`, `task_group`	Time between when a task became ready and when it was first leased (in milliseconds)
`silo_lease_reaper_duration_seconds`	Histogram	`shard`	Duration of expired lease reaper scan operations
`silo_lease_reaper_scans_total`	Counter	`shard`	Total number of expired lease reaper scan operations

Key insights:

silo_task_leases_active shows in-flight work at any given moment
Sudden drops may indicate worker crashes or network issues
Compare against worker count to understand utilization
silo_ready_to_start_latency_ms measures scheduling delay — high values indicate workers aren’t polling fast enough or broker scan intervals are too long. Unlike silo_job_wait_time_seconds (which measures total time from enqueue), this metric isolates the time a task sat ready but unleased
silo_lease_reaper_duration_seconds tracks how long the expired lease reaper takes to scan all leases per shard. High values indicate a large number of active leases or database pressure, and can contribute to increased CPU usage

Broker Metrics

The task broker maintains an in-memory buffer of ready tasks for efficient dequeue operations.

Metric	Type	Labels	Description
`silo_broker_buffer_size`	Gauge	`shard`, `task_group`	Number of tasks in the broker’s in-memory buffer per task group
`silo_broker_inflight_size`	Gauge	`shard`, `task_group`	Tasks claimed but not yet durably leased per task group
`silo_broker_scan_duration_seconds`	Histogram	`shard`	Duration of broker task scanning operations
`silo_broker_scans_total`	Counter	`shard`	Total number of broker task scan operations
`silo_broker_scan_tasks_read_total`	Counter	`shard`, `task_group`, `outcome`	Tasks read during scans, broken down by outcome: `inserted`, `skipped_future`, `skipped_inflight`, `skipped_tombstone`, `skipped_already_buffered`, `skipped_defunct`
`silo_broker_tombstone_count`	Gauge	`shard`, `task_group`	Number of ack tombstones currently held by the broker

Key insights:

silo_broker_buffer_size near 0 with pending work may indicate scan issues
Break down silo_broker_buffer_size by task_group to identify which queues are starved vs saturated
High silo_broker_scan_duration_seconds suggests database pressure
silo_broker_inflight_size should be transiently low; persistent high values indicate dequeue bottlenecks
A high rate of silo_broker_scan_tasks_read_total with outcome="skipped_tombstone" indicates the scanner is repeatedly encountering recently-acked keys that haven’t been compacted away yet
Growing silo_broker_tombstone_count may signal that dequeued task keys are persisting in the DB longer than expected

Shard & Coordination Metrics

These metrics track distributed shard ownership across the cluster.

Metric	Type	Labels	Description
`silo_shards_owned`	Gauge	-	Number of shards owned by this node (from coordinator)
`silo_coordination_shards_open`	Gauge	-	Number of shards currently open in this process

Key insights:

silo_shards_owned should match silo_coordination_shards_open after convergence
Discrepancies indicate shard acquisition/release in progress
Use for capacity planning: total shards / nodes = shards per node

Concurrency Metrics

Concurrency limits control how many jobs with the same concurrency key can run simultaneously.

Metric	Type	Labels	Description
`silo_concurrency_tickets_granted_total`	Counter	-	Total concurrency tickets granted

Key insights:

Track silo_concurrency_tickets_granted_total rate to understand throughput through limited queues

gRPC Metrics

These metrics track the gRPC API performance.

Metric	Type	Labels	Description
`silo_grpc_requests_total`	Counter	`method`, `status`	Total gRPC requests by method and status
`silo_grpc_request_duration_seconds`	Histogram	`method`	gRPC request latency by method

Key insights:

Monitor error rates via status label (look for non-OK statuses)
silo_grpc_request_duration_seconds helps identify slow operations
High latency on LeaseTasks may indicate database or broker issues

SlateDB Storage Metrics

Silo uses SlateDB as its underlying embedded key-value storage engine. These metrics expose SlateDB’s internal statistics for monitoring storage-layer health and performance.

Database Operations

Metric	Type	Labels	Description
`silo_slatedb_get_requests_total`	Counter	`shard`	Total number of GET (read) requests to SlateDB
`silo_slatedb_scan_requests_total`	Counter	`shard`	Total number of scan (range query) requests
`silo_slatedb_write_ops_total`	Counter	`shard`	Total number of individual write operations
`silo_slatedb_write_batch_count_total`	Counter	`shard`	Total number of write batches
`silo_slatedb_flush_requests_total`	Counter	`shard`	Total number of flush requests to SlateDB
`silo_slatedb_backpressure_count_total`	Counter	`shard`	Number of times writes were blocked by back-pressure
`silo_slatedb_total_mem_size_bytes`	Gauge	`shard`	Total memory usage of SlateDB (memtables, WAL buffers, etc.)

Key insights:

High silo_slatedb_backpressure_count_total indicates the storage layer is under write pressure
Compare silo_slatedb_write_ops_total to silo_slatedb_write_batch_count_total to understand batching efficiency
silo_slatedb_total_mem_size_bytes helps track memory pressure per shard — sudden increases may indicate write spikes or slow flushes

WAL (Write-Ahead Log) Metrics

Metric	Type	Labels	Description
`silo_slatedb_wal_buffer_estimated_bytes`	Gauge	`shard`	Estimated bytes buffered in the WAL buffer
`silo_slatedb_wal_buffer_flushes_total`	Counter	`shard`	Total number of WAL buffer flushes
`silo_slatedb_immutable_memtable_flushes_total`	Counter	`shard`	Total number of immutable memtable flushes to SSTs

Key insights:

silo_slatedb_wal_buffer_estimated_bytes shows pending writes not yet durably flushed
Monitor silo_slatedb_immutable_memtable_flushes_total rate to understand flush frequency
High WAL buffer sizes may indicate slow object storage writes

SST Filter (Bloom Filter) Metrics

SlateDB uses bloom filters to avoid unnecessary SST reads. These metrics track filter effectiveness.

Metric	Type	Labels	Description
`silo_slatedb_sst_filter_positives_total`	Counter	`shard`	True positives: key exists and filter said yes
`silo_slatedb_sst_filter_negatives_total`	Counter	`shard`	True negatives: key absent and filter said no (avoided read)
`silo_slatedb_sst_filter_false_positives_total`	Counter	`shard`	False positives: key absent but filter said yes (wasted read)

Key insights:

High silo_slatedb_sst_filter_negatives_total rate indicates filters are effective at avoiding reads
silo_slatedb_sst_filter_false_positives_total / total lookups gives the false positive rate
A false positive rate above 1-2% may indicate bloom filter tuning is needed

Compaction Metrics

SlateDB periodically compacts SST files to reclaim space and improve read performance.

Metric	Type	Labels	Description
`silo_slatedb_bytes_compacted_total`	Counter	`shard`	Total number of bytes compacted
`silo_slatedb_running_compactions`	Gauge	`shard`	Number of compactions currently running
`silo_slatedb_last_compaction_ts_seconds`	Gauge	`shard`	Unix timestamp of the last compaction
`silo_slatedb_l0_sst_count`	Gauge	`shard`	Number of Level-0 SSTs (high values indicate compaction lag)

Key insights:

silo_slatedb_running_compactions > 0 indicates compaction is actively working
Monitor silo_slatedb_bytes_compacted_total rate to understand compaction throughput
If silo_slatedb_last_compaction_ts_seconds is very old, compaction may be stuck or disabled
High silo_slatedb_l0_sst_count means scans must merge across many unsorted files, directly increasing scan latency. This is the most important metric for diagnosing slow scans

Cache Metrics

SlateDB maintains an in-memory block cache for SST data blocks, index blocks, and bloom filters. These metrics track cache effectiveness — low hit rates indicate that reads are falling through to object storage.

Metric	Type	Labels	Description
`silo_slatedb_cache_data_block_hit_total`	Counter	`shard`	Data block cache hits
`silo_slatedb_cache_data_block_miss_total`	Counter	`shard`	Data block cache misses (requires object storage read)
`silo_slatedb_cache_index_hit_total`	Counter	`shard`	Index block cache hits
`silo_slatedb_cache_index_miss_total`	Counter	`shard`	Index block cache misses
`silo_slatedb_cache_filter_hit_total`	Counter	`shard`	Bloom filter cache hits
`silo_slatedb_cache_filter_miss_total`	Counter	`shard`	Bloom filter cache misses

Key insights:

Compute hit rates as hit / (hit + miss) for each cache tier
Low data block hit rate combined with high L0 SST count strongly suggests compaction isn’t keeping up
Low filter/index cache hit rates may indicate the cache is too small for the working set — consider increasing SlateDB’s block cache size
Filter cache misses are especially expensive because they force a full block read to check for key existence

Tracing

Silo supports OpenTelemetry tracing via the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. When set, Silo exports traces using the OTLP protocol:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 silo -c config.toml

Key spans include:

enqueue - Job enqueue operations
dequeue - Task dequeue and lease creation
report_outcome - Attempt completion reporting
concurrency.grant / concurrency.release - Concurrency ticket operations

Logging

Silo uses structured logging via tracing. Configure the log output format in your configuration file:

[logging]
format = "json"  # "text" (default, human-readable) or "json" (structured)

Control log verbosity with the RUST_LOG environment variable:

RUST_LOG=info silo -c config.toml              # General info level
RUST_LOG=silo::coordination=trace silo -c config.toml  # Trace a specific module

Key log events to monitor:

shard opened/closed - Shard lifecycle
lease expired - Worker crashes or timeouts
rate limit check failed - Gubernator connectivity issues
failed to close shard - Graceful shutdown problems

CPU Profiling

Silo supports on-demand CPU profiling for production debugging. Profiles are captured using pprof-rs, a low-overhead sampling profiler, and returned in standard pprof protobuf format.

Capturing a Profile

Use siloctl to capture a CPU profile from a running node:

siloctl -a http://silo-node:7450 profile --duration 30

Options:

--duration, -d: Profile duration in seconds (1-300, default 30)
--frequency, -f: Sampling frequency in Hz (1-1000, default 100)
--output, -o: Output file path (default: profile-{timestamp}.pb.gz)

Example with all options:

siloctl -a http://silo-node:7450 profile \
  --duration 60 \
  --frequency 250 \
  --output my-profile.pb.gz

Analyzing Profiles

Profiles are saved in pprof protobuf format (gzip compressed). Analyze with either of these tools:

Using pprof CLI:

# Install pprof if needed
go install github.com/google/pprof@latest

# Open interactive web UI
pprof -http=:8080 profile-1706123456.pb.gz

Using go tool pprof:

go tool pprof -http=:8080 profile-1706123456.pb.gz

Both tools open an interactive web UI with:

Flame graphs for visualizing hot paths
Top functions by CPU time
Call graphs showing function relationships
Source code annotation (if source is available)

Production Considerations

Low overhead: Profiling uses sampling at the configured frequency, typically adding only 1-2% overhead
Safe defaults: The default 100Hz frequency is safe for production use
Higher detail: Increase frequency (up to 1000Hz) for more detail, but expect slightly higher overhead
Profile size: Typical 30-second profiles are 10-100KB compressed
Single profile at a time: Only one profile can be captured per node at a time; concurrent requests will wait

Common Profiling Scenarios

Investigating high CPU usage:

# Capture a 60-second profile during high load
siloctl -a http://silo-node:7450 profile --duration 60

# Open in pprof and look at the flame graph
pprof -http=:8080 profile-*.pb.gz

Comparing before/after a change:

# Capture baseline profile
siloctl profile --output baseline.pb.gz

# Deploy change, then capture new profile
siloctl profile --output after-change.pb.gz

# Compare using pprof's diff mode
pprof -http=:8080 -diff_base=baseline.pb.gz after-change.pb.gz

Heap Profiling

Silo also supports on-demand heap profiling using jemalloc’s built-in sampling profiler. A heap profile activates jemalloc profiling for a short window, then dumps the currently-live sampled allocations in jeprof format.

Capturing a Heap Profile

Use siloctl to capture a heap profile from a running node:

siloctl -a http://silo-node:7450 heap-profile --duration 30

Options:

--duration, -d: How long to wait before dumping the current live heap profile (1-300, default 30)
--output, -o: Output file path (default: heap-profile-{timestamp}.heap)

Analyzing Heap Profiles

Heap profiles are saved in jemalloc’s jeprof format. Analysis requires the exact silo binary that produced the dump:

# Text summary
jeprof --text /path/to/silo heap-profile-1706123456.heap

# SVG graph
jeprof --svg /path/to/silo heap-profile-1706123456.heap > heap.svg

To compare two dumps, use jeprof --base=<baseline.heap> /path/to/silo <after.heap>.

Production Considerations

Heap profiling is sampling-based rather than a full heap walk, which keeps overhead manageable.
The default runtime configuration enables profiling support but keeps sampling inactive until a profile is requested.
Profiles show sampled live allocations at dump time; the duration delays the dump and increases sample density during that window, but it does not restrict the dump to allocations created during the wait.
For a quick current snapshot, use --duration 1. For leak investigation or growth analysis, leave profiling active for several minutes before dumping.
Analysts need the exact matching binary for symbolization; in production, match it by image SHA rather than by version tag alone.
On macOS, the dump only covers allocations that flow through the process allocator; external system allocations may not appear.