Model Hosting
Model hosting is one of the most economically rewarding roles in the Citrate ecosystem, and it is where we see the most community interest. By making GPU resources available to the network, model hosts earn inference fees every time a smart contract or dApp uses their models. This guide covers model formats, GPU management, hosting configuration, and pricing strategies.
Supported Model Formats
Citrate's inference runtime supports three model formats. Choose based on your model type and performance requirements:
| Format | Extension | Best For | Advantages |
|---|---|---|---|
| ONNX | .onnx | General-purpose models, CNNs, transformers | Broad runtime support, hardware portability |
| SafeTensors | .safetensors | LLMs, diffusion models | Safe deserialization, fast loading |
| GGUF | .gguf | LLMs with quantization | CPU-friendly, flexible quantization levels |
To prepare your model, export it from your training framework:
# PyTorch to ONNX
python -c "
import torch
model = torch.load('my_model.pt')
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy, 'my_model.onnx',
opset_version=17,
input_names=['image'],
output_names=['prediction'])
"
# HuggingFace to SafeTensors (usually already in this format)
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('my-finetuned-model')
model.save_pretrained('model_dir', safe_serialization=True)
"
# Convert to GGUF for quantized serving
python convert.py model_dir --outfile my_model.gguf --outtype q4_k_m
GPU Allocation
The citrate-serve daemon manages GPU resources. Configure GPU allocation in your serving configuration to control how models are distributed across available hardware.
# In ~/.citrate/serve.toml
[gpu]
devices = ["cuda:0", "cuda:1"] # Available GPU devices
memory_fraction = 0.9 # Use 90% of VRAM (reserve 10% for overhead)
auto_offload = true # Automatically offload layers to CPU if VRAM full
# Per-model GPU assignment
[[models]]
model_id = "0xABC123..."
device = "cuda:0"
max_batch_size = 32
priority = "high"
[[models]]
model_id = "0xDEF456..."
device = "cuda:1"
max_batch_size = 16
priority = "normal"
Monitor GPU utilization in real time:
citrate-serve gpu-status
Example output:
Device | VRAM Used | VRAM Total | Utilization | Models Loaded
cuda:0 | 22.4 GB | 24.0 GB | 87% | sentiment-v1, entity-v2
cuda:1 | 15.1 GB | 24.0 GB | 62% | summarizer-v3
For multi-GPU setups with large models, enable tensor parallelism:
[[models]]
model_id = "0x789ABC..."
devices = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
parallel_mode = "tensor" # Options: tensor, pipeline, none
max_batch_size = 8
ONNX Runtime Configuration
For ONNX models, fine-tune the runtime for optimal performance:
[runtime.onnx]
execution_provider = "CUDAExecutionProvider" # or TensorrtExecutionProvider
graph_optimization_level = "all"
enable_mem_pattern = true
enable_cpu_mem_arena = true
intra_op_num_threads = 4
inter_op_num_threads = 2
The TensorRT execution provider offers significant speedups for NVIDIA GPUs but requires the TensorRT SDK. On first load, it compiles an optimized engine which adds startup time but improves inference latency by 2-5x.
Pricing Strategies
Setting the right price for your inference service balances profitability with competitiveness. The ModelRegistry enforces a minimum price floor, but beyond that, pricing is market-driven.
Check current market rates for your model category:
citrate-cli model market-rates --category "nlp/sentiment" --rpc https://rpc.cnidarian.cloud
Update your model's price:
citrate-cli model update-price --model-id 0xYOUR_MODEL_ID --new-price 0.0015 --rpc https://rpc.cnidarian.cloud --private-key $PRIVATE_KEY
Pricing factors to consider:
| Factor | Impact on Price |
|---|---|
| GPU compute cost | Your primary cost basis -- amortize hardware over expected lifespan |
| Electricity | Significant for GPU-intensive models |
| Network bandwidth | Matters for large input/output payloads |
| Stake opportunity cost | SALT locked as collateral cannot earn staking rewards |
| Model uniqueness | Rare or specialized models can command premium pricing |
| Reputation score | Higher reputation attracts more requests at any price point |
A common strategy is to start with a competitive (lower) price to build reputation, then gradually increase as your model's reputation score grows and attracts organic demand. We've seen hosts do well by undercutting the market by 10-15% early on and then adjusting upward once they have a solid track record.
Health and Availability
The network monitors model host availability and penalizes excessive downtime. Configure health checks to ensure your serving daemon stays responsive:
[health]
check_interval_seconds = 15
timeout_seconds = 5
unhealthy_threshold = 3 # Mark unhealthy after 3 failed checks
recovery_threshold = 2 # Mark healthy after 2 successful checks
[health.gpu]
max_temperature_celsius = 85 # Throttle if GPU exceeds this temperature
min_free_vram_mb = 500 # Alert if VRAM drops below this threshold
Further Reading
- Serving Inference -- the inference request lifecycle
- Registering a Model -- register your hosted model
- Hardware Requirements -- GPU selection guide
- Monitoring -- Prometheus metrics for model hosts