Serving Inference

Once you have registered a model on the Citrate ModelRegistry, the next step is running an inference endpoint that can receive requests from the network, execute model inference, and return attested results. We designed the serving stack to be as straightforward as possible, so you can go from registration to live inference in under an hour. This guide covers the hardware requirements, serving stack configuration, request lifecycle, and how fees are collected.

Hardware Requirements

Inference serving performance depends directly on your hardware. The Citrate network does not mandate specific hardware, but your node's response latency and throughput directly affect your model's reputation score and therefore its likelihood of being selected for inference requests.

Minimum specifications for GPU inference:

Component	Minimum	Recommended
GPU	NVIDIA RTX 3060 (12GB VRAM)	NVIDIA A100 (40GB) or H100
CPU	8 cores, 3.0 GHz	16+ cores, 3.5 GHz
RAM	32 GB	64 GB+
Storage	500 GB NVMe SSD	2 TB NVMe SSD
Network	100 Mbps symmetric	1 Gbps symmetric

For CPU-only inference (smaller models), you can skip the GPU requirement, but response times will be significantly longer. The network tracks latency percentiles, and consistently slow responses reduce your blue score multiplier.

Serving Configuration

Citrate uses a standardized serving daemon called citrate-serve that wraps your model runtime and handles network communication, attestation signing, and fee escrow.

Install and configure the serving daemon:

Install the serving daemon:

curl -L https://releases.cnidarian.cloud/citrate-serve/latest | bash

Initialize the configuration for your model:

citrate-serve init --model-id 0xYOUR_MODEL_ID

This generates a configuration file at ~/.citrate/serve.toml:

[server]
host = "0.0.0.0"
port = 8545
max_concurrent_requests = 16
request_timeout_ms = 30000
 
[model]
model_id = "0xYOUR_MODEL_ID"
runtime = "onnx"                # Options: onnx, vllm, tgi, custom
model_path = "/models/sentiment_v1.onnx"
device = "cuda:0"
 
[attestation]
private_key_path = "~/.citrate/keys/attestation.key"
verification_tier = "signature"  # Options: signature, optimistic, zk
 
[network]
rpc_url = "https://rpc.cnidarian.cloud"
chain_id = 1337
 
[fees]
collection_address = "0xYOUR_WALLET_ADDRESS"
auto_claim_threshold = 1.0       # Auto-claim fees above 1 SALT

Supported runtimes:

onnx -- ONNX Runtime for ONNX models, good general-purpose choice
vllm -- vLLM for large language models with PagedAttention
tgi -- HuggingFace Text Generation Inference
custom -- Bring your own HTTP server implementing the Citrate inference API

Handling Requests

When a smart contract calls InferenceEngine.requestInference(), the network routes the request to one or more registered model hosts. Your serving daemon receives the request, executes inference, and returns an attested result.

The request lifecycle:

Receive: The network delivers the inference request payload to your endpoint
Validate: citrate-serve validates the input against your model's declared schema
Execute: The model runtime processes the input and produces an output
Attest: The daemon signs the output with your attestation key, producing a cryptographic proof of execution
Return: The attested result is submitted back to the network for delivery to the calling contract

Monitor your serving daemon in real time:

Start the serving daemon:

citrate-serve start

In another terminal, check the daemon status:

citrate-serve status

View request logs in real time:

citrate-serve logs --follow

Check performance metrics:

citrate-serve metrics

Example log output:

[2025-01-15T14:23:01Z] INFO  request_id=0xabc... model=sentiment-v1 latency=47ms status=fulfilled
[2025-01-15T14:23:03Z] INFO  request_id=0xdef... model=sentiment-v1 latency=52ms status=fulfilled
[2025-01-15T14:23:05Z] WARN  request_id=0x123... model=sentiment-v1 latency=timeout status=failed

Failed requests are not penalized if they are rare, but a failure rate above 5% triggers reputation decay. The Mentorship Protocol (described in Gradient Paper IV) allows experienced operators to guide new hosts through optimization.

Fee Collection

Inference fees are collected in SALT and held in escrow by the InferenceEngine precompile until the result is confirmed. Once the calling contract receives and accepts the result, fees are released to your collection address.

Check pending and available fees:

citrate-cli fees balance --address 0xYOUR_WALLET_ADDRESS --rpc https://rpc.cnidarian.cloud

Manually claim accumulated fees:

citrate-cli fees claim --address 0xYOUR_WALLET_ADDRESS --rpc https://rpc.cnidarian.cloud --private-key $PRIVATE_KEY

Fee breakdown per request:

Component	Percentage	Recipient
Model operator	80%	Your collection address
Network fee	15%	Protocol treasury (burned/redistributed)
Attestation reward	5%	Validator that verified the attestation

If you enabled auto_claim_threshold in your configuration, citrate-serve will automatically submit claim transactions when your accumulated fees exceed the threshold, saving you from manual collection.

Performance Optimization

Maximizing throughput and minimizing latency directly increases your revenue and reputation. I'd suggest starting with the defaults and tuning incrementally based on your metrics:

Batch requests: Enable batching in serve.toml with max_batch_size to process multiple requests in a single forward pass
Model quantization: Use INT8 or FP16 quantization to reduce VRAM usage and increase throughput
Request queuing: Configure max_concurrent_requests based on your GPU memory to avoid OOM errors
Health checks: The network pings your endpoint every 30 seconds; ensure your health endpoint responds within 5 seconds