Model Hosting

Model hosting is one of the most economically rewarding roles in the Citrate ecosystem, and it is where we see the most community interest. By making GPU resources available to the network, model hosts earn inference fees every time a smart contract or dApp uses their models. This guide covers model formats, GPU management, hosting configuration, and pricing strategies.

Supported Model Formats

Citrate's inference runtime supports three model formats. Choose based on your model type and performance requirements:

Format	Extension	Best For	Advantages
ONNX	`.onnx`	General-purpose models, CNNs, transformers	Broad runtime support, hardware portability
SafeTensors	`.safetensors`	LLMs, diffusion models	Safe deserialization, fast loading
GGUF	`.gguf`	LLMs with quantization	CPU-friendly, flexible quantization levels

To prepare your model, export it from your training framework:

# PyTorch to ONNX
python -c "
import torch
model = torch.load('my_model.pt')
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy, 'my_model.onnx',
    opset_version=17,
    input_names=['image'],
    output_names=['prediction'])
"
 
# HuggingFace to SafeTensors (usually already in this format)
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('my-finetuned-model')
model.save_pretrained('model_dir', safe_serialization=True)
"
 
# Convert to GGUF for quantized serving
python convert.py model_dir --outfile my_model.gguf --outtype q4_k_m

GPU Allocation

The citrate-serve daemon manages GPU resources. Configure GPU allocation in your serving configuration to control how models are distributed across available hardware.

# In ~/.citrate/serve.toml
 
[gpu]
devices = ["cuda:0", "cuda:1"]       # Available GPU devices
memory_fraction = 0.9                 # Use 90% of VRAM (reserve 10% for overhead)
auto_offload = true                   # Automatically offload layers to CPU if VRAM full
 
# Per-model GPU assignment
[[models]]
model_id = "0xABC123..."
device = "cuda:0"
max_batch_size = 32
priority = "high"
 
[[models]]
model_id = "0xDEF456..."
device = "cuda:1"
max_batch_size = 16
priority = "normal"

Monitor GPU utilization in real time:

citrate-serve gpu-status

Example output:

Device    | VRAM Used | VRAM Total | Utilization | Models Loaded
cuda:0    | 22.4 GB   | 24.0 GB    | 87%         | sentiment-v1, entity-v2
cuda:1    | 15.1 GB   | 24.0 GB    | 62%         | summarizer-v3

For multi-GPU setups with large models, enable tensor parallelism:

[[models]]
model_id = "0x789ABC..."
devices = ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
parallel_mode = "tensor"             # Options: tensor, pipeline, none
max_batch_size = 8

ONNX Runtime Configuration

For ONNX models, fine-tune the runtime for optimal performance:

[runtime.onnx]
execution_provider = "CUDAExecutionProvider"  # or TensorrtExecutionProvider
graph_optimization_level = "all"
enable_mem_pattern = true
enable_cpu_mem_arena = true
intra_op_num_threads = 4
inter_op_num_threads = 2

The TensorRT execution provider offers significant speedups for NVIDIA GPUs but requires the TensorRT SDK. On first load, it compiles an optimized engine which adds startup time but improves inference latency by 2-5x.

Pricing Strategies

Setting the right price for your inference service balances profitability with competitiveness. The ModelRegistry enforces a minimum price floor, but beyond that, pricing is market-driven.

Check current market rates for your model category:

citrate-cli model market-rates --category "nlp/sentiment" --rpc https://rpc.cnidarian.cloud

Update your model's price:

citrate-cli model update-price --model-id 0xYOUR_MODEL_ID --new-price 0.0015 --rpc https://rpc.cnidarian.cloud --private-key $PRIVATE_KEY

Pricing factors to consider:

Factor	Impact on Price
GPU compute cost	Your primary cost basis -- amortize hardware over expected lifespan
Electricity	Significant for GPU-intensive models
Network bandwidth	Matters for large input/output payloads
Stake opportunity cost	SALT locked as collateral cannot earn staking rewards
Model uniqueness	Rare or specialized models can command premium pricing
Reputation score	Higher reputation attracts more requests at any price point

A common strategy is to start with a competitive (lower) price to build reputation, then gradually increase as your model's reputation score grows and attracts organic demand. We've seen hosts do well by undercutting the market by 10-15% early on and then adjusting upward once they have a solid track record.

Health and Availability

The network monitors model host availability and penalizes excessive downtime. Configure health checks to ensure your serving daemon stays responsive:

[health]
check_interval_seconds = 15
timeout_seconds = 5
unhealthy_threshold = 3              # Mark unhealthy after 3 failed checks
recovery_threshold = 2               # Mark healthy after 2 successful checks
 
[health.gpu]
max_temperature_celsius = 85         # Throttle if GPU exceeds this temperature
min_free_vram_mb = 500               # Alert if VRAM drops below this threshold