Paraconsistent Consensus
Recursive Mentor-Mentee Learning Over BlockDAG Finality Checkpoints in the Citrate Network
Larry Klosowski
Formalizes how consensus and federated learning are unified in one protocol. Uses Belnap FOUR-valued logic to handle contradictory model outputs without discarding them. Defines the mentor-mentee learning protocol that runs at every BFT checkpoint. Specifies LoRA adapter rank 16, embedding dimension 768, and the slashing parameters for embedding manipulation.
Abstract
We propose Paraconsistent Consensus, a theoretical framework for unifying BlockDAG consensus and federated meta-learning into a single protocol. The core contribution is a formal aggregation function grounded in Belnap’s four-valued logic (FOUR), which classifies each node’s model output into one of four epistemic states...relevant, irrelevant, contradictory, or unknown...relative to a query, and preserves all four states through aggregation rather than collapsing them via averaging. The framework operates atop the Citrate Network’s GhostDAG consensus with BFT finality checkpoints (described in Paper I), extending checkpoint commitments to include learning state. Each network node is treated as both a consensus validator and a computational neuron in a distributed meta-learning architecture. We describe a lightweight routing model...a small attention layer or MLP, not a full transformer...trained at checkpoint intervals to learn input-dependent routing across heterogeneous node models. For model adaptation, we propose LoRA-based adapter generation and composition, and present an honest assessment of the interference bounds: frozen base weights provide structural protection against catastrophic forgetting, but multi-adapter composition introduces interference effects that require active mitigation. We present the theoretical framework with formal definitions and state convergence conditions as hypotheses to be validated empirically. No benchmarks are reported in this paper; experimental methodology for validation is described in Section 8.
Keywords: paraconsistent logic, Belnap FOUR, federated learning, BlockDAG, meta-learning, LoRA composition, GhostDAG, distributed consensus, mixture of experts
1. Introduction
Federated learning systems and blockchain consensus protocols solve structurally similar problems through different mechanisms. Both coordinate distributed agents toward shared objectives without central authority. Both must tolerate Byzantine participants. Both produce collective outputs (aggregated gradients; finalized blocks) from potentially contradictory local observations. Yet the two fields have developed largely independently, with blockchain projects that incorporate AI treating consensus and computation as orthogonal concerns.
This paper asks whether consensus and learning can be the same process. Specifically: can the act of participating in BlockDAG consensus simultaneously contribute to a distributed learning objective, using the DAG’s causal structure for gradient ordering, its blue-score mechanism for trust-weighted aggregation, and its finality checkpoints for learning synchronization?
We propose Paraconsistent Consensus, a framework built on three technical contributions:
First, a formal aggregation function based on Belnap’s four-valued logic [14, 28] that handles contradictory model outputs without discarding minority contributions or collapsing into meaningless averages. Where classical federated averaging treats disagreement as noise, paraconsistent aggregation treats it as information...a signal that different nodes have learned different things, which a routing function can exploit.
Second, a checkpoint-synchronized learning protocol that extends BFT finality checkpoints to include learning state: meta-model routing weights, per-node performance profiles, and LoRA adapter registries. This transforms the checkpoint from a pure consensus mechanism into a learning synchronization point.
Third, a bounded analysis of LoRA adapter composition for distributed model adaptation, grounded in recent literature on task arithmetic [17], interference mitigation [18], and the learn-less-forget-less tradeoff [19]. We present both the structural advantages and the known limitations of adapter-based composition honestly.
Implementation status. This paper describes a theoretical framework. The underlying infrastructure...GhostDAG consensus, BFT finality, LVM execution, and AI precompiles...is implemented and described in Paper I. The extensions proposed here...paraconsistent aggregation, checkpoint-synchronized learning, meta-model training, and LoRA lifecycle management...are designed but not yet implemented. We use [Implemented], [Specified], and [Hypothesis] tags throughout, following the convention established in Paper I.
2. Background and Related Work
2.1 BlockDAG Consensus
[Implemented] The Citrate consensus layer implements GhostDAG [1, 2] with BFT finality, as specified in Paper I. Blocks reference up to 10 parents (1 selected + 9 merge), are classified into blue and red sets via the k-cluster rule (k=18), and achieve finality through committee signatures at checkpoint intervals. The block time is approximately 0.5 seconds (2 BPS), with BFT finality checkpoints every 10 blocks (approximately 5 seconds). These parameters are shared with Paper I and serve as the infrastructure on which this paper’s theoretical extensions are built.
For context, the Kaspa network’s Crescendo hardfork (May 2025) demonstrated that GhostDAG can operate at 10 BPS with k=124 and max parents=16, achieving approximately 3,585-4,000 TPS on mainnet [30]. Citrate’s more conservative configuration (2 BPS, k=18) is chosen to accommodate the per-block overhead of embedding vectors when the learning extensions described in this paper are activated.
2.2 Federated Learning
Federated Averaging (FedAvg) [6] established the canonical approach to distributed learning: N clients train on local data, compute gradient updates, and submit them to a central server that aggregates via weighted averaging. The framework preserves data privacy but introduces a central point of failure. Recent improvements address specific limitations: FedProx [31] adds a proximal regularization term for heterogeneous settings; SCAFFOLD [32] corrects client drift via control variates; and Byzantine-robust aggregation rules (Krum [33], trimmed mean, coordinate-wise median) tolerate malicious updates. None eliminate the trusted aggregator.
The aggregation function in all these approaches is a weighted combination that collapses N client updates into a single update, discarding information about disagreement structure. If two clients produce opposite gradients for the same layer, averaging cancels the signal. This is acceptable when disagreement reflects noise but wasteful when it reflects legitimate specialization...the core insight motivating paraconsistent aggregation.
2.3 LoRA and Adapter Composition
Low-Rank Adaptation (LoRA) [8] parameterizes model updates as low-rank perturbations W + BA, where B and A are low-rank matrices. This provides two structural properties relevant to our framework: the base model weights W remain frozen (enabling rollback by subtracting the adapter), and the update magnitude is bounded by the adapter’s spectral norm.
However, the literature on adapter composition reveals significant challenges. Ilharco et al. [17] introduced task arithmetic...adding and subtracting task vectors to edit model behavior...and showed that naive addition of task vectors causes interference when tasks modify overlapping parameters. Yadav et al. [18] proposed TIES-Merging to mitigate this interference through magnitude pruning and sign conflict resolution. Biderman et al. [19] demonstrated the fundamental tradeoff: LoRA “learns less and forgets less” compared to full finetuning, providing better preservation of base model capabilities but reduced task-specific performance. Most recently, OPLoRA [34] identifies that interference concentrates in the dominant singular directions of pre-trained weights and proposes orthogonal projection to mitigate it.
For Paraconsistent Consensus, these findings mean that LoRA-based adaptation provides a structurally favorable starting point...frozen base weights bound maximum regression...but that multi-adapter composition over time will require active interference mitigation. We address this honestly in Section 5.
2.4 Paraconsistent Logic
Classical logic enforces the principle of explosion: from a contradiction, anything follows (ex falso quodlibet). In a distributed network where honest nodes can produce legitimately different outputs from the same input...due to model heterogeneity, different training data, or stochastic inference...this property is destructive. If two nodes disagree, classical aggregation must either average (losing both signals), discard one (losing minority information), or fail entirely.
Paraconsistent logics tolerate contradiction without trivialization [14]. We adopt Belnap’s four-valued logic, specifically the system described in “A Useful Four-Valued Logic” [28] and “How a Computer Should Think” [28b], which was designed precisely for the problem of reasoning with contradictory information from multiple sources. Belnap’s FOUR defines a bilattice with four values: T (true...supported by evidence), F (false...contradicted by evidence), B (both...supported and contradicted simultaneously), and N (neither...no evidence available). The bilattice has two orderings: a truth ordering (F ≤ N ≤ T and F ≤ B ≤ T) and an information ordering (N ≤ T ≤ B and N ≤ F ≤ B). Belnap’s key insight: the information ordering tracks how much we know, independent of what we know. A state of B (contradictory) contains more information than T or F alone...we know that sources disagree, which is itself informative.
This maps directly to our federated learning setting. When nodes agree on a classification, the aggregated state is T. When they disagree, the state is B...and rather than discarding this, we preserve it for the routing function, which can learn to route contradictory inputs to the most reliable expert.
2.5 Mixture of Experts
The Mixture-of-Experts (MoE) architecture [7] provides the template for our routing approach: a gating function routes inputs to specialized sub-networks. In standard MoE, both the experts and the gating function are trained jointly. In our distributed setting, the experts (node models) are trained independently and the gating function (routing model) is trained at consensus checkpoints on aggregated embeddings. The attention mechanism [9] provides the mathematical machinery for input-dependent routing, where each node’s embedding serves as a key/value pair and the inference query serves as the query.
3. The Paraconsistent Aggregation Function
3.1 Formal Framework
[Specified] We formalize the aggregation problem as follows. Let N be the set of nodes in the network, each producing a d-dimensional embedding vector eᵢ ∈ ℝᵈ for a given reference input. Let bᵢ ∈ [0,1] be node i’s normalized blue score (derived from GhostDAG’s blue set classification) and cᵢ ∈ [0,1]ᵈ be node i’s per-dimension confidence vector (derived from softmax entropy).
We define a classification function φ: ℝᵈ × ℝᵈ → {T, F, B, N}ᵈ that maps each node’s embedding to a Belnap state relative to the current query, operating dimension-wise:
For each dimension j of embedding eᵢ:
• T (relevant): cᵢⱼ > θ_high and eᵢⱼ is directionally consistent with the blue-score-weighted majority. The node is confident and agrees with the network.
• F (irrelevant): cᵢⱼ > θ_high and eᵢⱼ is directionally inconsistent with the majority. The node is confident but disagrees. This is not “wrong”...it signals potential specialization.
• B (contradictory): Multiple nodes with comparable blue scores produce inconsistent embeddings with comparable confidence. The network has conflicting evidence. This state is preserved rather than averaged, signaling to the routing function that this input requires expert disambiguation.
• N (unknown): cᵢⱼ < θ_low. The node’s model is uncertain about this dimension. This embedding dimension contributes minimally to aggregation.
3.2 Aggregation Rule
[Specified] The paraconsistent aggregation produces two outputs: an aggregated embedding e_agg and a state vector s ∈ {T, F, B, N}ᵈ characterizing the epistemic status of each dimension.
The aggregated embedding is computed as a trust-confidence-weighted combination:
e_agg[j] = Σᵢ (wᵢ · eᵢ[j]) / Σᵢ wᵢ
where the weight wᵢ = cᵢ[j] · softmax(bᵢ / τ) combines per-dimension confidence with blue-score-derived trust. The temperature τ controls trust concentration: low τ concentrates weight on high-blue-score nodes; high τ distributes weight more uniformly.
Critically, the state vector s is computed independently of the aggregated embedding. Even when e_agg produces a reasonable weighted average, s[j] = B signals that this dimension carried contradictory evidence...information that pure averaging would destroy. The routing model receives both e_agg and s as input, enabling it to make routing decisions informed by the network’s agreement structure, not just the aggregated values.
3.3 Comparison to Standard Aggregation
Table 1. Aggregation Approaches Under Node Disagreement
Method
Handles Disagreement By
Information Preserved
Limitation
FedAvg [6]
Weighted average
None...cancellation
Opposing gradients cancel
Krum [33]
Selecting most representative
One viewpoint only
Discards minority views
Trimmed Mean
Removing extremes, averaging rest
Central tendency
Removes potential experts
Paraconsistent (ours)
Classifying into Belnap FOUR, preserving state vector
Full disagreement structure
Requires routing model to exploit
The tradeoff is explicit: paraconsistent aggregation preserves more information but requires a routing model capable of exploiting that information. If the routing model is poorly trained, the additional information is wasted and the system reduces to weighted averaging. The value of paraconsistent aggregation is therefore contingent on the quality of the routing model...a dependency we acknowledge as a primary validation target.
4. Checkpoint-Synchronized Learning
4.1 Extended Checkpoint Structure
[Specified] We propose extending the BFT finality checkpoint (Paper I, Section 2.3) with three additional committed fields: (a) a routing weights hash...the Merkle root of the current meta-model’s routing weights, enabling any node to verify it has the correct routing model; (b) an adapter registry hash...the Merkle root of the active LoRA adapter registry, tracking which adapters are assigned to which nodes; and (c) a performance profile hash...the Merkle root of per-node performance metrics aggregated over the preceding checkpoint interval.
This extension is backward-compatible: nodes that do not participate in federated learning ignore these fields, and the consensus protocol treats learning and non-learning nodes identically for ordering purposes. The checkpoint interval (every 10 blocks, approximately 5 seconds at 2 BPS) establishes the learning synchronization cadence...the minimum interval at which the network’s collective intelligence is committed and propagated.
4.2 The Routing Model
[Specified] The routing model is a lightweight component...not a full transformer. In the initial implementation, we specify a small multi-layer perceptron (MLP) or single-layer attention mechanism with approximately 100K-500K parameters. This model takes as input: (a) the query embedding (d dimensions), (b) the aggregated embedding e_agg (d dimensions), (c) the Belnap state vector s (d dimensions, encoded as 2-bit per dimension...one bit for truth ordering, one for information ordering), and (d) compressed per-node capability profiles (k dimensions per node, where k << d). It outputs routing weights across participating nodes, determining which nodes should serve inference for the given query.
A model of this size can be retrained in milliseconds on modern hardware, making checkpoint-interval updates feasible even at 5-second intervals. We reserve the term “second-order transformer” for a long-term research goal in which a full attention mechanism is trained over node embeddings...treating each node’s full embedding as a token...but this is not the initial architecture. The initial routing model is deliberately simple to ensure it can be updated reliably at consensus timescales.
4.3 Training Protocol
[Specified] The learning protocol operates in three phases, described as design specifications rather than implemented behavior:
Phase 1: Embedding Collection. Nodes include embedding vectors in their blocks (adding approximately 3 KB per block for a 768-dimensional float32 embedding). The routing model begins learning from accumulated embeddings after a calibration period. During this phase, no adapters are generated; the system learns which nodes are reliable and which exhibit specialization.
Phase 2: Routing Activation. Once the routing model achieves a confidence threshold (measured by entropy reduction in routing decisions over successive checkpoints), inference queries begin using learned routing rather than uniform distribution. Nodes that consistently provide high-quality responses for specific input classes receive proportionally more traffic.
Phase 3: Adapter Generation. The routing model identifies node weaknesses...input classes where a node’s performance is consistently below the network average...and generates targeted LoRA adapters. These adapters are registered on-chain via the LoRAFactory precompile (Paper I, address 0x1003) and distributed to the target nodes via IPFS. This is the mentorship mechanism: stronger nodes effectively teach weaker nodes through parameter-efficient fine-tuning signals.
We emphasize that the phase boundaries and transition criteria are design parameters, not empirical findings. The specific calibration period duration, confidence thresholds, and adapter generation frequency require tuning on a live testnet...work we describe as future experimental methodology in Section 8.
5. LoRA Adapter Composition: An Honest Assessment
5.1 Structural Advantages
The LoRA architecture provides three structural properties favorable to our federated setting:
Bounded perturbation. Each adapter perturbs the base model by ΔW = BA, where the perturbation magnitude is bounded by ‖BA‖ₛ ≤ ‖B‖ₛ‖A‖ₛ. For rank-r adapters, this provides an explicit, auditable bound on the maximum deviation from base model behavior.
Clean rollback. Since base weights are frozen, an adapter can be removed by subtraction (setting ΔW = 0), restoring the exact pre-adapter model state. This provides a safety mechanism unavailable in full finetuning: if an adapter degrades performance, it can be cleanly reverted.
Reduced forgetting. Biderman et al. [19] demonstrated that LoRA mitigates catastrophic forgetting more effectively than full finetuning, weight decay, and dropout...though it does not eliminate forgetting entirely. Their results show that LoRA better maintains base model performance on tasks outside the fine-tuning domain, which is the core property we rely on for federated composition.
5.2 Composition Challenges
We do not claim that adapter composition avoids regression “by construction.” The literature is clear that naive additive composition of LoRA adapters...or more generally, task vectors...causes interference when adapters modify overlapping parameter subspaces [17, 18, 34]. Specifically:
Task arithmetic interference. Ilharco et al. [17] showed that adding task vectors (the difference between fine-tuned and pre-trained weights) enables flexible model editing, but that the sum of multiple task vectors produces interference proportional to the cosine similarity between the vectors. When adapters are independently trained on different tasks, their weight updates can point in conflicting directions, particularly in overlapping parameter regions.
LoRA-specific merging challenges. Recent work at ICLR 2025 demonstrates that standard merging methods (task arithmetic, TIES-Merging) transfer poorly to LoRA models compared to fully fine-tuned models, because LoRA’s low-rank constraint concentrates updates in a lower-dimensional subspace where conflicts are more likely [35]. KnOTS [35] addresses this through SVD-based alignment into a shared space before merging.
Dominant subspace interference. OPLoRA [34] identifies that forgetting concentrates in the dominant singular directions of pre-trained weights...the principal components that encode the most important learned representations. Adapters that inadvertently modify these directions cause disproportionate degradation.
5.3 Bounded Regression Guarantee
[Hypothesis] We state the following bounded guarantee, which holds under specified conditions:
Theorem 1 (Bounded Regression Under Perfect Routing). Let M₀ be the base model with weights W, and let a₁, …, aₖ be k LoRA adapters with perturbations ΔW₁, …, ΔWₖ. If the routing function achieves perfect specialization...each query is routed to exactly one adapter, and no adapter is applied to inputs outside its training distribution...then the maximum regression on any task t served by adapter aᵢ is bounded by ‖ΔWᵢ‖ₛ · L, where L is the Lipschitz constant of the model’s forward pass. Under imperfect routing with bounded error ε, the regression bound becomes O(ε · maxᵢ ‖ΔWᵢ‖ₛ · L).
This guarantee degrades with routing error. If the routing function misassigns queries, multiple adapters may be applied to overlapping inputs, and the interference effects documented in [17, 18, 34] apply. The guarantee is therefore only as strong as the routing model, creating a circular dependency: the routing model’s quality depends on the training signal from the learning loop, which depends on the adapters’ quality, which depends on routing. We acknowledge this circularity and propose the following mitigation strategies as future work:
(a) Orthogonal adapter training. Following OPLoRA [34], constrain adapter updates to lie in the orthogonal complement of the base model’s dominant singular directions. This provably prevents interference with the most important learned representations.
(b) Task-gated composition. Rather than additively composing adapters, use the routing model to select a single adapter per query. This avoids multi-adapter interference entirely but sacrifices the possibility of combining complementary adapter knowledge.
(c) Periodic adapter consolidation. At regular intervals (e.g., every 1,000 checkpoints), consolidate accumulated adapters into a single adapter via SVD-based alignment [35], reducing the number of active adapters and resolving accumulated interference.
(d) Adapter retirement. The LoRAFactory precompile supports an explicit retirement mechanism: adapters whose contribution falls below a threshold (measured by routing weight decay) are removed from the active registry, bounding the total number of concurrent adapters.
6. Block Structure and Integration
6.1 Extended Block Fields
[Specified] The standard GhostDAG block structure is extended with three optional fields: (a) an embedding vector eᵢ ∈ ℝᵈ, computed by passing a shared reference input through the node’s local model (approximately 3 KB for d=768, float32); (b) a confidence vector cᵢ ∈ [0,1]ᵈ, derived from softmax entropy (approximately 3 KB); and (c) a gradient commitment...a cryptographic hash of the node’s gradient update, revealed in the subsequent block to prevent front-running. These fields are optional and backward-compatible: non-participating nodes leave them empty.
6.2 Overhead Analysis
[Specified] The learning protocol adds per-block overhead of approximately 6-7 KB (embedding + confidence vectors for d=768). At 2 BPS, this is approximately 12-14 KB/s of additional bandwidth per node, well within the 10 MB maximum block size specified in Paper I. For larger embedding dimensions (d=1024 or d=2048), the overhead scales linearly: 8-9 KB or 16-18 KB per block respectively. If overhead becomes constraining at higher block rates, embedding compression (e.g., float16 quantization, PCA dimensionality reduction, or learned compression) can reduce it by 2-4×, though with some fidelity loss in the Belnap state classification. The tradeoff between embedding fidelity and overhead is an engineering parameter to be tuned on the testnet.
7. Safety, Liveness, and Convergence
7.1 Consensus Safety and Liveness
[Implemented...inherited from Paper I] The learning extensions do not modify the consensus protocol’s safety or liveness guarantees. GhostDAG block ordering, blue/red set classification, and BFT finality operate identically whether or not the learning fields are populated. Safety (no conflicting finalized states) holds under f < n/3, and liveness (all honest transactions confirmed) holds under f < n/2, by the arguments in Paper I, Section 2.3. The learning protocol rides on top of consensus, using its outputs (blue scores, finality commitments) as inputs to the aggregation function, but does not participate in the consensus mechanism itself.
7.2 Learning Convergence
[Hypothesis] We state the following convergence conjecture. Under standard federated learning convergence conditions...bounded gradient variance σ², L-smooth loss function, and either convexity or the Polyak-Łojasiewicz condition...the paraconsistent aggregation converges to a stationary point of the global loss function.
The argument proceeds as follows. The aggregation weights wᵢ are normalized (softmax output) and bounded. Under honest majority, blue-score weighting down-weights Byzantine contributions. The confidence weighting provides variance reduction by emphasizing nodes that are more certain. Together, these produce a weighted gradient estimator whose bias and variance can be bounded in terms of the temperature τ and the honest-to-Byzantine ratio.
However, this argument is not a proof. Several complications prevent a clean convergence guarantee: (a) the blue score itself is a function of network dynamics, not a fixed trust weight, introducing a time-varying coefficient that standard convergence proofs do not handle; (b) the Belnap state classification introduces a discontinuity (thresholding) that may interfere with gradient-based optimization; and (c) the routing model and the node models are updated simultaneously, creating a coupled dynamics problem. We state this as Conjecture 1 and propose the experimental methodology to test it in Section 8. Proving convergence for the coupled consensus-learning system under Byzantine conditions is an open problem that we identify as the primary theoretical challenge for this framework.
7.3 Slashing Extensions
[Specified] The standard slashing conditions from Paper I (50% for fraudulent inference, 25% for extended downtime, 10% for equivocation) are extended with one learning-specific condition: embedding manipulation (10% of stake)...submitting embedding vectors that are detectably inconsistent with the node’s registered model. Detection is via random spot-checks where validators independently compute the embedding for the same reference input and compare. This slashing condition requires the verifiable inference infrastructure from Paper I (Section 3.3) and operates at the optimistic verification tier: manipulation is assumed innocent unless challenged.
8. Experimental Methodology: What We Plan to Test
This section describes our planned experimental validation. No experiments have been conducted. We include this section to demonstrate that the theoretical framework generates testable hypotheses and to specify the methodology before results are collected, avoiding post-hoc rationalization.
8.1 Hypothesis 1: Paraconsistent Aggregation Outperforms Averaging
Claim to test: On a network of heterogeneous models (different architectures, training data distributions, and capability profiles), paraconsistent aggregation with Belnap FOUR state vectors produces higher routing accuracy than blue-score-weighted averaging alone.
Methodology: Deploy N=50-100 nodes on a testnet, each hosting a different fine-tuned model (varying architectures from 1B-7B parameters). Use a held-out evaluation set with known domain labels. Compare routing accuracy (fraction of queries routed to the best-performing node for that domain) across three conditions: (a) uniform routing, (b) blue-score-weighted averaging, (c) paraconsistent aggregation with Belnap state vectors. Report mean accuracy, per-domain accuracy, and calibration curves.
Baseline: If paraconsistent aggregation does not outperform weighted averaging by a statistically significant margin (p < 0.05, corrected for multiple comparisons), the added complexity of Belnap state classification is not justified, and the framework should simplify to weighted averaging...a valuable negative result.
8.2 Hypothesis 2: Adapter Composition Improves Over Time
Claim to test: Over successive checkpoint intervals, LoRA adapters generated by the routing model improve node performance on weak domains without degrading performance on strong domains beyond a bounded regression.
Methodology: Run the full learning loop for an extended period (target: 10,000 checkpoints, approximately 14 hours at 5-second intervals). Track per-node, per-domain accuracy at each checkpoint. Measure: (a) weak-domain improvement rate, (b) strong-domain regression rate, (c) total adapter count and interference metrics (cosine similarity between adapter weight vectors), (d) routing model entropy (measuring specialization). Compare with and without adapter consolidation.
Success criteria: Weak-domain accuracy improves monotonically (within noise) over checkpoint intervals, and strong-domain regression remains below the theoretical bound from Theorem 1. If regression exceeds the bound, identify the routing error rate and determine whether mitigation strategies (a)-(d) from Section 5.3 are necessary.
8.3 Hypothesis 3: Convergence Under Byzantine Conditions
Claim to test: The learning protocol converges to useful routing in the presence of up to f < n/3 Byzantine nodes submitting adversarial embeddings.
Methodology: Introduce Byzantine nodes at varying fractions (10%, 20%, 30% of network) with three adversarial strategies: (a) random embeddings, (b) strategically misleading embeddings (designed to corrupt routing for specific input classes), (c) gradient poisoning (submitting gradients that maximize divergence from the true gradient). Measure convergence time, routing accuracy at convergence, and the effectiveness of blue-score down-weighting at isolating Byzantine contributions.
This is the hardest test. Strategic Byzantine actors who understand the aggregation function can design adversarial embeddings that pass confidence checks while corrupting the routing model. We do not claim robustness to all adversarial strategies...we aim to characterize the attack surface and identify which strategies are mitigated by blue-score weighting and which require additional defense.
9. Biological Inspiration
The recursive learning architecture draws inspiration from cnidarian (jellyfish) neurobiology, as described in the companion Paper IX. The nerve net of Aurelia aurita achieves global motor coordination through purely local interactions: neurons propagate signals through overlapping local neighborhoods without central arbitration, refractory periods prevent message duplication, and the network’s topology determines its computational properties [20, 21]. These biological observations motivated three specific design choices in Paraconsistent Consensus: (a) the use of DAG topology (rather than a centralized aggregator) for embedding propagation, mirroring the nerve net’s leaderless coordination; (b) the checkpoint mechanism as an analogue to the nerve net’s through-conducting pulse, which periodically synchronizes the network’s state; and (c) the emphasis on preserving disagreement (paraconsistent aggregation) rather than forcing consensus, mirroring the nerve net’s tolerance for simultaneous excitatory and inhibitory signals. These analogies are design inspirations, not formal justifications. The system’s properties are established through the distributed systems and machine learning arguments in Sections 3-7, not through biological analogy.
10. Conclusion
Paraconsistent Consensus proposes that blockchain consensus and federated learning can be unified into a single protocol, using DAG topology for gradient ordering, blue scores for trust-weighted aggregation, finality checkpoints for learning synchronization, and Belnap’s four-valued logic for information-preserving aggregation. The framework’s central claim...that preserving contradictory information through Belnap state vectors enables better routing than collapsing it through averaging...is a testable hypothesis, not a demonstrated result.
We have been deliberate about the boundaries of our claims. The aggregation function is formally defined but not empirically validated. The LoRA composition bounds hold under stated conditions that may not obtain in practice. The convergence conjecture is plausible but unproven. The routing model is specified as a lightweight MLP, not the “second-order transformer” of the long-term vision. These are not weaknesses of the paper...they are the honest state of a theoretical framework that awaits experimental validation.
The infrastructure required for validation...GhostDAG consensus, BFT finality, LVM execution, and AI precompiles...is implemented and described in Paper I. The organizational learning theory motivating the mentor-mentee architecture is developed in Paper III. The engineering methodology for testing is described in Paper IV. What remains is the bridge between theory and measurement, which the experimental methodology in Section 8 is designed to provide.
References
[1] Sompolinsky, Y., & Zohar, A. (2015). Secure high-rate transaction processing in Bitcoin. Financial Cryptography and Data Security, 507-527.
[2] Sompolinsky, Y., & Zohar, A. (2018). PHANTOM and GHOSTDAG. IACR Cryptology ePrint Archive.
[3] Keidar, I., et al. (2021). All you need is DAG. Proceedings of the ACM PODC, 165-175.
[4] Danezis, G., et al. (2022). Narwhal and Tusk: A DAG-based mempool and efficient BFT consensus. EuroSys.
[5] Spiegelman, A., et al. (2022). Bullshark: DAG BFT protocols made practical. CCS.
[6] McMahan, B., et al. (2017). Communication-efficient learning of deep networks from decentralized data. AISTATS, 1273-1282.
[7] Shazeer, N., et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR 2017.
[8] Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
[9] Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 30.
[10] Buterin, V. (2014). Ethereum: A next-generation smart contract and decentralized application platform.
[11] Castro, M., & Liskov, B. (1999). Practical Byzantine fault tolerance. OSDI, 173-186.
[12] Ben-Sasson, E., et al. (2014). Succinct non-interactive zero knowledge for a von Neumann architecture. USENIX Security.
[13] Rao, J., & Opentensor Foundation. (2021). Bittensor: A peer-to-peer intelligence market.
[14] Priest, G. (2006). In Contradiction: A Study of the Transconsistent. Oxford University Press.
[15] Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
[16] Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system.
[17] Ilharco, G., et al. (2023). Editing models with task arithmetic. ICLR 2023.
[18] Yadav, P., et al. (2023). TIES-Merging: Resolving interference when merging models. NeurIPS 2023.
[19] Biderman, D., et al. (2024). LoRA learns less and forgets less. Transactions on Machine Learning Research. Featured Certification.
[20] Anderson, P. A. V. (1989). Evolution of the First Nervous Systems. NATO ASI Series, Springer.
[21] Weissbourd, B., et al. (2021). A genetically tractable jellyfish model for systems and evolutionary neuroscience. Cell, 184(24), 5854-5868.
[22] Klosowski, L. (2025). The Medusa Paradigm. Cnidarian Foundation Working Paper.
[23] Anthropic. (2024). Model Context Protocol Specification. Anthropic Technical Report.
[24] Senge, P. M. (1990). The Fifth Discipline. Doubleday.
[25] Nonaka, I., & Takeuchi, H. (1995). The Knowledge-Creating Company. Oxford University Press.
[26] Argyris, C., & Schön, D. A. (1978). Organizational Learning. Addison-Wesley.
[27] Pallasdies, F., et al. (2019). From single neurons to behavior in the jellyfish Aurelia aurita. eLife, 8, e50084.
[28] Belnap, N. D. (1977). A useful four-valued logic. In: Dunn, J. M., Epstein, G. (eds) Modern Uses of Multiple-Valued Logic. Episteme, vol 2. Springer.
[28b] Belnap, N. D. (1977). How a computer should think. In: Ryle, G. (ed) Contemporary Aspects of Philosophy. Oriel Press.
[29] Dunn, J. M. (2019). Two, three, four, infinity: The path to the four-valued logic and beyond. In: New Essays on Belnap-Dunn Logic. Springer.
[30] Kaspa Network. (2025). KIP-14: Crescendo Hardfork. Activated May 5, 2025.
[31] Li, T., et al. (2020). Federated optimization in heterogeneous networks. MLSys 2020.
[32] Karimireddy, S. P., et al. (2020). SCAFFOLD: Stochastic controlled averaging for federated learning. ICML 2020.
[33] Blanchard, P., et al. (2017). Machine learning with adversaries: Byzantine tolerant gradient descent. NeurIPS 2017.
[34] OPLoRA. (2025). Orthogonal Projection LoRA Prevents Catastrophic Forgetting. arXiv:2510.13003.
[35] KnOTS. (2025). Knowledge Transfer via SVD for Parameter-Efficient Multi-Task Model Merging. ICLR 2025.
Appendix A: Protocol Parameters (Shared with Paper I)
Table A1. Parameters Referenced in This Paper
Parameter
Value
Source
Block time
~0.5 seconds (2 BPS)
Paper I, Section 2.2
k parameter
18
Paper I, Section 2.1
Max parents
10 (1 selected + 9 merge)
Paper I, Section 2.1
BFT committee
100 validators, 67 signatures
Paper I, Section 2.3
Checkpoint interval
10 blocks (~5 seconds)
Paper I, Section 2.3
Max block size
10 MB
Paper I, Appendix A
SALT total supply
1,000,000,000
Paper I, Section 5
Minimum stake
10,000 SALT
Paper I, Section 5
Slashing (fraud)
50% of stake
Paper I, Section 6
Slashing (downtime)
25% of stake (>24h)
Paper I, Section 6
Slashing (equivocation)
10% of stake
Paper I, Section 6
Slashing (embedding manipulation)
10% of stake
This paper, Section 7.3
Fraud proof window
100 blocks (~50 seconds)
Paper I, Section 3.3
Table A2. Learning Protocol Parameters (Specified, Not Implemented)
Parameter
Specified Value
Rationale
Embedding dimension (d)
768 (default), configurable
Matches common transformer hidden sizes
Embedding precision
float32 (default), float16 optional
Tradeoff: fidelity vs. bandwidth
Per-block embedding overhead
~3 KB (d=768, float32)
Within 10 MB block size limit
Routing model size
100K-500K parameters
Must retrain at checkpoint interval
Belnap threshold (θ_high)
0.8 (proposed)
Requires calibration on testnet
Belnap threshold (θ_low)
0.3 (proposed)
Requires calibration on testnet
Temperature (τ)
1.0 (default)
Governs blue-score trust concentration
Adapter rank (r)
16 (default), configurable
Standard LoRA configuration
Adapter consolidation interval
Every 1,000 checkpoints
Balance: freshness vs. interference
───
This paper is part of the Gradient Papers series published by the Cnidarian Foundation.
Correspondence: larry@cnidarianfoundation.org