Monitoring
Effective monitoring is essential for maintaining a healthy Citrate node, whether you are running a validator, miner, model host, or oracle. We consider this a non-negotiable part of any production setup. This guide covers the full monitoring stack: Prometheus metrics collection, Grafana visualization, alerting rules for critical events, log management, and health check endpoints.
Prometheus Metrics
The Citrate client (citrated) exposes a Prometheus-compatible metrics endpoint. Enable it in your configuration:
# In ~/.citrate/config.toml
[metrics]
enabled = true
prometheus_addr = "0.0.0.0:9100"
metrics_prefix = "citrate"
Configure Prometheus to scrape your node by adding a job to prometheus.yml:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "citrate-node"
static_configs:
- targets: ["localhost:9100"]
- job_name: "citrate-serve"
static_configs:
- targets: ["localhost:9101"]
# Only needed if running a model host
Key metrics exposed by citrated:
| Metric | Type | Description |
|---|---|---|
citrate_chain_head_block | Gauge | Current chain head block number |
citrate_chain_sync_progress | Gauge | Sync progress (0.0 to 1.0) |
citrate_p2p_peers | Gauge | Number of connected peers |
citrate_consensus_votes_cast | Counter | Total BFT votes cast (validators) |
citrate_consensus_votes_missed | Counter | Total BFT votes missed (validators) |
citrate_mining_blocks_produced | Counter | Total blocks produced (miners) |
citrate_mining_blue_blocks | Counter | Blue blocks produced (miners) |
citrate_mining_hashrate | Gauge | Current hashrate (hashes/sec) |
citrate_inference_requests_served | Counter | Inference requests served (model hosts) |
citrate_inference_latency_ms | Histogram | Inference latency distribution |
citrate_oracle_attestations | Counter | Attestations submitted (oracles) |
citrate_gpu_temperature_celsius | Gauge | GPU temperature (model hosts) |
citrate_gpu_utilization_percent | Gauge | GPU utilization percentage |
Grafana Dashboards
Import the official Citrate Grafana dashboards for pre-built visualizations. Download the dashboard JSON files and import them through the Grafana UI.
Download the official dashboards:
curl -L https://releases.cnidarian.cloud/monitoring/grafana-dashboards.tar.gz -o dashboards.tar.gz
Extract them into your Grafana dashboards directory:
tar -xzf dashboards.tar.gz -C /var/lib/grafana/dashboards/
Available dashboards:
- Node Overview -- sync status, peer count, resource usage, uptime
- Validator Performance -- vote participation, missed votes, slashing events, rewards
- Mining Dashboard -- hashrate, block production, blue/red ratio, earnings
- Inference Serving -- request rate, latency percentiles, GPU usage, fee revenue
- Oracle Monitoring -- attestation rate, latency, chain connectivity, fee accumulation
Each dashboard includes template variables for filtering by node name, time range, and specific metrics.
Alerting Rules
Configure Prometheus alerting rules to notify you of critical events before they cause slashing or revenue loss.
# alerting_rules.yml
groups:
- name: citrate_node
rules:
- alert: NodeOutOfSync
expr: citrate_chain_sync_progress < 0.99
for: 5m
labels:
severity: critical
annotations:
summary: "Citrate node is out of sync"
description: "Node sync progress is {{ $value }} (expected > 0.99)"
- alert: LowPeerCount
expr: citrate_p2p_peers < 5
for: 2m
labels:
severity: warning
annotations:
summary: "Low peer count: {{ $value }} peers"
- name: citrate_validator
rules:
- alert: MissedVotes
expr: rate(citrate_consensus_votes_missed[5m]) > 0.1
labels:
severity: critical
annotations:
summary: "Validator is missing BFT votes"
- alert: HighMissRate
expr: >
citrate_consensus_votes_missed
/ (citrate_consensus_votes_cast + citrate_consensus_votes_missed)
> 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "Vote miss rate above 5%"
- name: citrate_model_host
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, citrate_inference_latency_ms) > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "P95 inference latency exceeds 5 seconds"
- alert: GPUOverheating
expr: citrate_gpu_temperature_celsius > 85
for: 1m
labels:
severity: critical
annotations:
summary: "GPU temperature is {{ $value }}C"
I'd recommend routing alerts to at least two channels so you never miss a critical event. Configure Alertmanager for your preferred notification targets:
# alertmanager.yml
route:
receiver: "slack-notifications"
group_wait: 30s
routes:
- match:
severity: critical
receiver: "pagerduty"
receivers:
- name: "slack-notifications"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#citrate-alerts"
- name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_KEY"
Log Management
citrated outputs structured JSON logs that can be ingested by any log aggregation system (Loki, Elasticsearch, CloudWatch).
View live logs with pretty-printed JSON:
citrated start --config ~/.citrate/config.toml 2>&1 | jq .
Configure log rotation in ~/.citrate/config.toml:
[node]
log_level = "info"
log_format = "json"
log_file = "/var/log/citrate/citrated.log"
log_max_size_mb = 100
log_max_files = 10
For Docker deployments, use the built-in Docker logging driver:
docker run -d --name citrate-node --log-driver json-file --log-opt max-size=100m --log-opt max-file=5 ghcr.io/cnidarian/citrated:latest
Health Check Endpoints
The Citrate client exposes HTTP health check endpoints for load balancers and orchestration systems:
Basic health check (returns 200 if the node is running):
curl http://localhost:8545/health
Detailed health with sync status:
curl http://localhost:8545/health/detailed
Ready check (returns 200 only if fully synced and participating):
curl http://localhost:8545/ready
Use these endpoints with Kubernetes liveness and readiness probes, or with external uptime monitoring services to ensure your node stays healthy and responsive.
Further Reading
- Hardware Requirements -- ensure your hardware supports the monitoring overhead
- Validator Setup -- validator-specific metrics to watch
- Model Hosting -- inference-specific monitoring
- Oracle Node Setup -- oracle attestation monitoring