Monitoring

Effective monitoring is essential for maintaining a healthy Citrate node, whether you are running a validator, miner, model host, or oracle. We consider this a non-negotiable part of any production setup. This guide covers the full monitoring stack: Prometheus metrics collection, Grafana visualization, alerting rules for critical events, log management, and health check endpoints.

Prometheus Metrics

The Citrate client (citrated) exposes a Prometheus-compatible metrics endpoint. Enable it in your configuration:

# In ~/.citrate/config.toml
[metrics]
enabled = true
prometheus_addr = "0.0.0.0:9100"
metrics_prefix = "citrate"

Configure Prometheus to scrape your node by adding a job to prometheus.yml:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: "citrate-node"
    static_configs:
      - targets: ["localhost:9100"]
 
  - job_name: "citrate-serve"
    static_configs:
      - targets: ["localhost:9101"]
    # Only needed if running a model host

Key metrics exposed by citrated:

Metric	Type	Description
`citrate_chain_head_block`	Gauge	Current chain head block number
`citrate_chain_sync_progress`	Gauge	Sync progress (0.0 to 1.0)
`citrate_p2p_peers`	Gauge	Number of connected peers
`citrate_consensus_votes_cast`	Counter	Total BFT votes cast (validators)
`citrate_consensus_votes_missed`	Counter	Total BFT votes missed (validators)
`citrate_mining_blocks_produced`	Counter	Total blocks produced (miners)
`citrate_mining_blue_blocks`	Counter	Blue blocks produced (miners)
`citrate_mining_hashrate`	Gauge	Current hashrate (hashes/sec)
`citrate_inference_requests_served`	Counter	Inference requests served (model hosts)
`citrate_inference_latency_ms`	Histogram	Inference latency distribution
`citrate_oracle_attestations`	Counter	Attestations submitted (oracles)
`citrate_gpu_temperature_celsius`	Gauge	GPU temperature (model hosts)
`citrate_gpu_utilization_percent`	Gauge	GPU utilization percentage

Grafana Dashboards

Import the official Citrate Grafana dashboards for pre-built visualizations. Download the dashboard JSON files and import them through the Grafana UI.

Download the official dashboards:

curl -L https://releases.cnidarian.cloud/monitoring/grafana-dashboards.tar.gz -o dashboards.tar.gz

Extract them into your Grafana dashboards directory:

tar -xzf dashboards.tar.gz -C /var/lib/grafana/dashboards/

Available dashboards:

Node Overview -- sync status, peer count, resource usage, uptime
Validator Performance -- vote participation, missed votes, slashing events, rewards
Mining Dashboard -- hashrate, block production, blue/red ratio, earnings
Inference Serving -- request rate, latency percentiles, GPU usage, fee revenue
Oracle Monitoring -- attestation rate, latency, chain connectivity, fee accumulation

Each dashboard includes template variables for filtering by node name, time range, and specific metrics.

Alerting Rules

Configure Prometheus alerting rules to notify you of critical events before they cause slashing or revenue loss.

# alerting_rules.yml
groups:
  - name: citrate_node
    rules:
      - alert: NodeOutOfSync
        expr: citrate_chain_sync_progress < 0.99
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Citrate node is out of sync"
          description: "Node sync progress is {{ $value }} (expected > 0.99)"
 
      - alert: LowPeerCount
        expr: citrate_p2p_peers < 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count: {{ $value }} peers"
 
  - name: citrate_validator
    rules:
      - alert: MissedVotes
        expr: rate(citrate_consensus_votes_missed[5m]) > 0.1
        labels:
          severity: critical
        annotations:
          summary: "Validator is missing BFT votes"
 
      - alert: HighMissRate
        expr: >
          citrate_consensus_votes_missed
          / (citrate_consensus_votes_cast + citrate_consensus_votes_missed)
          > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Vote miss rate above 5%"
 
  - name: citrate_model_host
    rules:
      - alert: HighInferenceLatency
        expr: histogram_quantile(0.95, citrate_inference_latency_ms) > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 inference latency exceeds 5 seconds"
 
      - alert: GPUOverheating
        expr: citrate_gpu_temperature_celsius > 85
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature is {{ $value }}C"

I'd recommend routing alerts to at least two channels so you never miss a critical event. Configure Alertmanager for your preferred notification targets:

# alertmanager.yml
route:
  receiver: "slack-notifications"
  group_wait: 30s
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
 
receivers:
  - name: "slack-notifications"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#citrate-alerts"
  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_KEY"

Log Management

citrated outputs structured JSON logs that can be ingested by any log aggregation system (Loki, Elasticsearch, CloudWatch).

View live logs with pretty-printed JSON:

citrated start --config ~/.citrate/config.toml 2>&1 | jq .

Configure log rotation in ~/.citrate/config.toml:

[node]
log_level = "info"
log_format = "json"
log_file = "/var/log/citrate/citrated.log"
log_max_size_mb = 100
log_max_files = 10

For Docker deployments, use the built-in Docker logging driver:

docker run -d --name citrate-node --log-driver json-file --log-opt max-size=100m --log-opt max-file=5 ghcr.io/cnidarian/citrated:latest

Health Check Endpoints

The Citrate client exposes HTTP health check endpoints for load balancers and orchestration systems:

Basic health check (returns 200 if the node is running):

curl http://localhost:8545/health

Detailed health with sync status:

curl http://localhost:8545/health/detailed

Ready check (returns 200 only if fully synced and participating):

curl http://localhost:8545/ready

Use these endpoints with Kubernetes liveness and readiness probes, or with external uptime monitoring services to ensure your node stays healthy and responsive.