Running a node

Monitoring

Effective monitoring is essential for maintaining a healthy Citrate node, whether you are running a validator, miner, model host, or oracle. We consider this a non-negotiable part of any production setup. This guide covers the full monitoring stack: Prometheus metrics collection, Grafana visualization, alerting rules for critical events, log management, and health check endpoints.

Prometheus Metrics

The Citrate client (citrated) exposes a Prometheus-compatible metrics endpoint. Enable it in your configuration:

# In ~/.citrate/config.toml
[metrics]
enabled = true
prometheus_addr = "0.0.0.0:9100"
metrics_prefix = "citrate"

Configure Prometheus to scrape your node by adding a job to prometheus.yml:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: "citrate-node"
    static_configs:
      - targets: ["localhost:9100"]
 
  - job_name: "citrate-serve"
    static_configs:
      - targets: ["localhost:9101"]
    # Only needed if running a model host

Key metrics exposed by citrated:

MetricTypeDescription
citrate_chain_head_blockGaugeCurrent chain head block number
citrate_chain_sync_progressGaugeSync progress (0.0 to 1.0)
citrate_p2p_peersGaugeNumber of connected peers
citrate_consensus_votes_castCounterTotal BFT votes cast (validators)
citrate_consensus_votes_missedCounterTotal BFT votes missed (validators)
citrate_mining_blocks_producedCounterTotal blocks produced (miners)
citrate_mining_blue_blocksCounterBlue blocks produced (miners)
citrate_mining_hashrateGaugeCurrent hashrate (hashes/sec)
citrate_inference_requests_servedCounterInference requests served (model hosts)
citrate_inference_latency_msHistogramInference latency distribution
citrate_oracle_attestationsCounterAttestations submitted (oracles)
citrate_gpu_temperature_celsiusGaugeGPU temperature (model hosts)
citrate_gpu_utilization_percentGaugeGPU utilization percentage

Grafana Dashboards

Import the official Citrate Grafana dashboards for pre-built visualizations. Download the dashboard JSON files and import them through the Grafana UI.

Download the official dashboards:

curl -L https://releases.cnidarian.cloud/monitoring/grafana-dashboards.tar.gz -o dashboards.tar.gz

Extract them into your Grafana dashboards directory:

tar -xzf dashboards.tar.gz -C /var/lib/grafana/dashboards/

Available dashboards:

  • Node Overview -- sync status, peer count, resource usage, uptime
  • Validator Performance -- vote participation, missed votes, slashing events, rewards
  • Mining Dashboard -- hashrate, block production, blue/red ratio, earnings
  • Inference Serving -- request rate, latency percentiles, GPU usage, fee revenue
  • Oracle Monitoring -- attestation rate, latency, chain connectivity, fee accumulation

Each dashboard includes template variables for filtering by node name, time range, and specific metrics.

Alerting Rules

Configure Prometheus alerting rules to notify you of critical events before they cause slashing or revenue loss.

# alerting_rules.yml
groups:
  - name: citrate_node
    rules:
      - alert: NodeOutOfSync
        expr: citrate_chain_sync_progress < 0.99
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Citrate node is out of sync"
          description: "Node sync progress is {{ $value }} (expected > 0.99)"
 
      - alert: LowPeerCount
        expr: citrate_p2p_peers < 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count: {{ $value }} peers"
 
  - name: citrate_validator
    rules:
      - alert: MissedVotes
        expr: rate(citrate_consensus_votes_missed[5m]) > 0.1
        labels:
          severity: critical
        annotations:
          summary: "Validator is missing BFT votes"
 
      - alert: HighMissRate
        expr: >
          citrate_consensus_votes_missed
          / (citrate_consensus_votes_cast + citrate_consensus_votes_missed)
          > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Vote miss rate above 5%"
 
  - name: citrate_model_host
    rules:
      - alert: HighInferenceLatency
        expr: histogram_quantile(0.95, citrate_inference_latency_ms) > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 inference latency exceeds 5 seconds"
 
      - alert: GPUOverheating
        expr: citrate_gpu_temperature_celsius > 85
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature is {{ $value }}C"

I'd recommend routing alerts to at least two channels so you never miss a critical event. Configure Alertmanager for your preferred notification targets:

# alertmanager.yml
route:
  receiver: "slack-notifications"
  group_wait: 30s
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
 
receivers:
  - name: "slack-notifications"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#citrate-alerts"
  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_KEY"

Log Management

citrated outputs structured JSON logs that can be ingested by any log aggregation system (Loki, Elasticsearch, CloudWatch).

View live logs with pretty-printed JSON:

citrated start --config ~/.citrate/config.toml 2>&1 | jq .

Configure log rotation in ~/.citrate/config.toml:

[node]
log_level = "info"
log_format = "json"
log_file = "/var/log/citrate/citrated.log"
log_max_size_mb = 100
log_max_files = 10

For Docker deployments, use the built-in Docker logging driver:

docker run -d --name citrate-node --log-driver json-file --log-opt max-size=100m --log-opt max-file=5 ghcr.io/cnidarian/citrated:latest

Health Check Endpoints

The Citrate client exposes HTTP health check endpoints for load balancers and orchestration systems:

Basic health check (returns 200 if the node is running):

curl http://localhost:8545/health

Detailed health with sync status:

curl http://localhost:8545/health/detailed

Ready check (returns 200 only if fully synced and participating):

curl http://localhost:8545/ready

Use these endpoints with Kubernetes liveness and readiness probes, or with external uptime monitoring services to ensure your node stays healthy and responsive.

Further Reading