Serving Inference
Once you have registered a model on the Citrate ModelRegistry, the next step is running an inference endpoint that can receive requests from the network, execute model inference, and return attested results. We designed the serving stack to be as straightforward as possible, so you can go from registration to live inference in under an hour. This guide covers the hardware requirements, serving stack configuration, request lifecycle, and how fees are collected.
Hardware Requirements
Inference serving performance depends directly on your hardware. The Citrate network does not mandate specific hardware, but your node's response latency and throughput directly affect your model's reputation score and therefore its likelihood of being selected for inference requests.
Minimum specifications for GPU inference:
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA RTX 3060 (12GB VRAM) | NVIDIA A100 (40GB) or H100 |
| CPU | 8 cores, 3.0 GHz | 16+ cores, 3.5 GHz |
| RAM | 32 GB | 64 GB+ |
| Storage | 500 GB NVMe SSD | 2 TB NVMe SSD |
| Network | 100 Mbps symmetric | 1 Gbps symmetric |
For CPU-only inference (smaller models), you can skip the GPU requirement, but response times will be significantly longer. The network tracks latency percentiles, and consistently slow responses reduce your blue score multiplier.
Serving Configuration
Citrate uses a standardized serving daemon called citrate-serve that wraps your model runtime and handles network communication, attestation signing, and fee escrow.
Install and configure the serving daemon:
Install the serving daemon:
curl -L https://releases.cnidarian.cloud/citrate-serve/latest | bash
Initialize the configuration for your model:
citrate-serve init --model-id 0xYOUR_MODEL_ID
This generates a configuration file at ~/.citrate/serve.toml:
[server]
host = "0.0.0.0"
port = 8545
max_concurrent_requests = 16
request_timeout_ms = 30000
[model]
model_id = "0xYOUR_MODEL_ID"
runtime = "onnx" # Options: onnx, vllm, tgi, custom
model_path = "/models/sentiment_v1.onnx"
device = "cuda:0"
[attestation]
private_key_path = "~/.citrate/keys/attestation.key"
verification_tier = "signature" # Options: signature, optimistic, zk
[network]
rpc_url = "https://rpc.cnidarian.cloud"
chain_id = 1337
[fees]
collection_address = "0xYOUR_WALLET_ADDRESS"
auto_claim_threshold = 1.0 # Auto-claim fees above 1 SALT
Supported runtimes:
- onnx -- ONNX Runtime for ONNX models, good general-purpose choice
- vllm -- vLLM for large language models with PagedAttention
- tgi -- HuggingFace Text Generation Inference
- custom -- Bring your own HTTP server implementing the Citrate inference API
Handling Requests
When a smart contract calls InferenceEngine.requestInference(), the network routes the request to one or more registered model hosts. Your serving daemon receives the request, executes inference, and returns an attested result.
The request lifecycle:
- Receive: The network delivers the inference request payload to your endpoint
- Validate:
citrate-servevalidates the input against your model's declared schema - Execute: The model runtime processes the input and produces an output
- Attest: The daemon signs the output with your attestation key, producing a cryptographic proof of execution
- Return: The attested result is submitted back to the network for delivery to the calling contract
Monitor your serving daemon in real time:
Start the serving daemon:
citrate-serve start
In another terminal, check the daemon status:
citrate-serve status
View request logs in real time:
citrate-serve logs --follow
Check performance metrics:
citrate-serve metrics
Example log output:
[2025-01-15T14:23:01Z] INFO request_id=0xabc... model=sentiment-v1 latency=47ms status=fulfilled
[2025-01-15T14:23:03Z] INFO request_id=0xdef... model=sentiment-v1 latency=52ms status=fulfilled
[2025-01-15T14:23:05Z] WARN request_id=0x123... model=sentiment-v1 latency=timeout status=failed
Failed requests are not penalized if they are rare, but a failure rate above 5% triggers reputation decay. The Mentorship Protocol (described in Gradient Paper IV) allows experienced operators to guide new hosts through optimization.
Fee Collection
Inference fees are collected in SALT and held in escrow by the InferenceEngine precompile until the result is confirmed. Once the calling contract receives and accepts the result, fees are released to your collection address.
Check pending and available fees:
citrate-cli fees balance --address 0xYOUR_WALLET_ADDRESS --rpc https://rpc.cnidarian.cloud
Manually claim accumulated fees:
citrate-cli fees claim --address 0xYOUR_WALLET_ADDRESS --rpc https://rpc.cnidarian.cloud --private-key $PRIVATE_KEY
Fee breakdown per request:
| Component | Percentage | Recipient |
|---|---|---|
| Model operator | 80% | Your collection address |
| Network fee | 15% | Protocol treasury (burned/redistributed) |
| Attestation reward | 5% | Validator that verified the attestation |
If you enabled auto_claim_threshold in your configuration, citrate-serve will automatically submit claim transactions when your accumulated fees exceed the threshold, saving you from manual collection.
Performance Optimization
Maximizing throughput and minimizing latency directly increases your revenue and reputation. I'd suggest starting with the defaults and tuning incrementally based on your metrics:
- Batch requests: Enable batching in
serve.tomlwithmax_batch_sizeto process multiple requests in a single forward pass - Model quantization: Use INT8 or FP16 quantization to reduce VRAM usage and increase throughput
- Request queuing: Configure
max_concurrent_requestsbased on your GPU memory to avoid OOM errors - Health checks: The network pings your endpoint every 30 seconds; ensure your health endpoint responds within 5 seconds
Further Reading
- Hardware Requirements -- full node hardware specifications
- Model Hosting -- detailed hosting configuration
- Verifiable Inference -- attestation tiers explained
- Registering a Model -- the registration process