Skip to content
VRevisionHardware / How

ATIS

Analog Token Importance Scoring for Energy-Efficient Transformer Attention Pruning

Larry Klosowski · Claude (Anthropic)

Download PDFDownload DOCX
Analog Token Importance Scoring. A research proposal for using field-programmable analog arrays (FPAAs) to pre-filter transformer attention before digital compute receives it. If this works at scale, inference energy costs drop by orders of magnitude. This is the most speculative paper in the series and the most clearly labeled as such.
IVIX

Abstract

Transformer models have become foundational to modern AI, yet their quadratic attention complexity creates significant energy and latency bottlenecks, particularly for edge deployment. Current token pruning approaches rely on digital computation to score token importance after computing expensive matrix multiplications, missing the opportunity for early-stage filtering. We propose ATIS (Analog Token Importance Scoring), a novel analog middleware architecture based on Field-Programmable Analog Arrays (FPAAs) that performs ultra-low-power importance scoring before digital attention computation begins. By leveraging operational transconductance amplifiers (OTAs) for approximate dot-product computation and analog comparators for threshold-based pruning, ATIS targets elimination of 60-80% of tokens from subsequent digital processing while consuming microwatt-scale power in the analog core. Critically, ATIS is distinguished from prior analog-digital hybrid attention accelerators...which employ fixed-function ASIC designs with in-memory computing...by its use of reconfigurable FPAA hardware that enables rapid prototyping, model-agnostic deployment, and indefinite reprogramming without device wear-out. We present the theoretical framework, proposed circuit architecture, an honest assessment of the DAC/ADC interface bottleneck that constrains all analog neural accelerators, and a phased research roadmap toward proof-of-concept implementation.

Keywords: Field-Programmable Analog Arrays, transformer attention, token pruning, analog computing, edge AI, reconfigurable hardware, energy-efficient inference, hybrid analog-digital architecture

1. Introduction

The transformer architecture has revolutionized machine learning, achieving state-of-the-art performance across natural language processing, computer vision, and multimodal tasks. However, the self-attention mechanism at the heart of transformers presents a fundamental computational challenge: its O(n²) complexity with respect to sequence length creates severe bottlenecks in both energy consumption and latency, particularly as context windows expand to accommodate increasingly complex tasks.

Recent research has demonstrated that attention patterns in trained transformers exhibit significant sparsity...the majority of tokens contribute minimally to any given output computation. Studies on vision transformers show that up to 66% of input tokens can be pruned with less than 0.5% accuracy degradation [3]. In language models, learned token pruning methods demonstrate that 70-80% of tokens can be filtered per layer while maintaining task performance [2]. The zero-shot token pruning framework Zero-TPrune further demonstrates that importance scoring can leverage existing attention graph structure without additional training overhead [14].

The critical insight motivating this work is that current pruning approaches operate after the expensive attention computation has already begun. Digital importance scoring requires fetching token embeddings, computing partial products, and making threshold comparisons...all before any computational savings are realized. This creates a paradox: the very computation required to determine what can be skipped often negates the benefits of skipping.

We propose a fundamentally different approach: analog pre-filtering of token importance before digital attention computation begins. By leveraging the inherent properties of analog circuits...continuous-time operation, parallel computation via physical laws, and extremely low power consumption in the microwatt to nanowatt range...we can create a lightweight “importance scoring” middleware that sits between the token embedding stage and the digital attention accelerator.

Field-Programmable Analog Arrays (FPAAs) provide an ideal platform for this approach. Unlike fixed-function analog circuits, FPAAs offer reconfigurability comparable to their digital counterparts (FPGAs) while maintaining analog’s inherent advantages for certain mathematical operations. Modern FPAAs incorporate operational transconductance amplifiers (OTAs), programmable comparators, and configurable routing...precisely the building blocks needed for approximate vector-matrix multiplication and threshold-based decision making. Critically, FPAAs based on floating-gate transistors can implement vector-matrix multiplication within the routing fabric itself, achieving extraordinary density and energy efficiency [4, 5, 8].

Differentiation from prior work. Moradifirouzabadi et al. [6] demonstrated a hybrid analog-digital attention accelerator fabricated in 65nm CMOS that uses charge-based SRAM compute-in-memory (CIM) to prune approximately 75% of low-score tokens. While ATIS shares the high-level philosophy of analog pre-filtering followed by digital precision computation, our approach differs in three fundamental respects: (1) ATIS operates on token embeddings before any QKᵀ computation, whereas CIM-based approaches compute approximate attention scores from queries and keys that have already been projected; (2) ATIS uses reconfigurable FPAA hardware that can be reprogrammed for different models without fabricating new silicon; and (3) ATIS targets a middleware role that is platform-agnostic, designed to sit between any embedding pipeline and any digital accelerator, rather than being an integrated processor.

2. Background and Related Work

2.1 Transformer Attention Mechanism

The scaled dot-product attention mechanism computes: Attention(Q, K, V) = softmax(QKᵀ / √d_k)V, where Q (queries), K (keys), and V (values) are linear projections of the input sequence. For a sequence of length n with embedding dimension d, the QKᵀ computation alone requires O(n²d) multiply-accumulate operations. In multi-head attention with h heads, this becomes O(hn²d), repeated at every layer. ATIS is primarily applicable to encoder-style attention and the prefill-phase computation in decoder models, where full input sequences are processed simultaneously.

2.2 Token Pruning Approaches

Token pruning methods fall into several categories. Static pruning removes tokens based on position or fixed criteria, while dynamic pruning adapts to input content. Learned Token Pruning (LTP) trains per-layer thresholds [2]. DynamicViT inserts lightweight prediction modules between transformer blocks [3]. The A2SF framework [16] extended token pruning to decoder-based large language models with fairness corrections for tokens of different ages. A critical observation from this literature is that token importance can often be approximated from early-layer features...the cumulative attention received by each token provides a reliable importance signal. Zero-TPrune [14] demonstrated that a Weighted Page Rank algorithm applied to the attention graph can identify unimportant tokens zero-shot. This enables our analog approach: we do not need perfect importance scores, only approximate rankings sufficient to identify clearly unimportant tokens.

2.3 Analog Computing for Neural Networks

Analog in-memory computing has emerged as a promising approach for energy-efficient neural network acceleration. Memristor crossbar arrays can perform vector-matrix multiplication in O(1) time by exploiting Ohm’s law and Kirchhoff’s current law for accumulation. Recent demonstrations have shown 70,000× energy reduction and 100× speedup compared to GPUs for attention computation using gain-cell-based analog memory [13]. However, memristor-based approaches face challenges for attention mechanisms specifically: the key and query matrices change dynamically during inference, requiring constant reprogramming that exceeds device endurance limits (10⁴-10⁸ write cycles).

2.4 The DAC/ADC Interface Bottleneck

A critical and often underappreciated constraint on all analog neural network accelerators is the energy and latency overhead of analog-to-digital and digital-to-analog conversion. Research has demonstrated that DAC/ADC peripheral circuits can account for up to 85% of total power consumption in mixed-signal neural computing hardware [19]. The Burr et al. comprehensive review [9] confirms that peripheral circuits typically dominate the energy consumption, area, and latency of analog neuromorphic accelerators. ATIS’s approach...using binary comparator outputs rather than high-resolution ADC readout...inherently sidesteps the ADC bottleneck on the output side, but the DAC overhead for converting digital embeddings to analog voltages remains a significant cost that must be honestly accounted for in any energy analysis (see Section 4.2).

2.5 Field-Programmable Analog Arrays

FPAAs consist of Configurable Analog Blocks (CABs) connected through a programmable routing network. Each CAB typically contains OTAs, capacitors, comparators, and sometimes dedicated multiplier blocks. The RASP (Reconfigurable Analog Signal Processor) family, developed at Georgia Tech, specifically includes 4×4 vector-matrix multiplier modules within their CABs [5, 8]. A key innovation in modern FPAAs is the use of floating-gate transistors for both routing switches and weight storage. As Hasler describes [4], the floating-gate-enabled routing crossbar provides excellent switches while also enabling vector-matrix multiplication in the routing fabric through analog programming. SoC FPAA devices have demonstrated end-to-end machine learning applications from microphones to classified results at power levels below 23 µW [23], suggesting that next-generation FPAAs designed for machine learning could handle workloads at dramatically lower energy than digital alternatives.

2.6 Differentiation from Prior Analog-Digital Hybrid Accelerators

Table 1. ATIS vs. Related Analog-Digital Hybrid Accelerators

System

Hardware

Stage of Operation

Reconfigurable?

Write Endurance

Moradifirouzabadi [6]

65nm CMOS ASIC, SRAM CIM

After Q/K projection

No (ASIC)

Unlimited (SRAM)

HARDSEA [17]

ReRAM + SRAM hybrid

Product-quantized token relevance

No (NVM weights)

Limited (10⁴-10⁸)

Sebastian [13]

Gain-cell analog memory

Full attention acceleration

Partial

Good (gain-cell)

Wang [11]

Memristor crossbar

Full attention (PSPICE sim)

No

Limited (10⁴-10⁸)

ATIS (ours)

FPAA (floating-gate)

Before Q/K projection

Yes (indefinite)

Unlimited (floating-gate)

3. ATIS Architecture

3.1 System Overview

ATIS operates as a middleware layer between token embedding and attention computation. The core insight is that we do not need to compute exact attention scores to identify unimportant tokens...we only need to identify tokens that are clearly below an importance threshold. This relaxed accuracy requirement enables analog implementation with significant energy savings, provided the DAC interface overhead is properly managed.

The system flow: (1) Token embeddings from the digital domain are converted to analog voltages via DACs; (2) the FPAA computes approximate importance scores using OTA-based inner products against a learned query prototype; (3) analog comparators generate binary keep/prune decisions; (4) only tokens exceeding the threshold are forwarded to the digital attention accelerator; (5) the digital system computes precise attention only for the surviving token subset. The output of the analog pipeline is a simple binary mask...not a high-resolution analog signal...which means the output path requires only comparators rather than energy-expensive ADCs.

3.2 Importance Score Computation

We define an approximate importance score for token i as:

Sᵢ ≈ Σⱼ (qⱼ · kᵢⱼ)

where q represents a learned or averaged query vector and kᵢ is the key projection of token i. This formulation has several advantages for analog implementation. First, it requires only vector-vector products, not full matrix multiplication...each token’s importance can be computed independently and in parallel. Second, the accumulation naturally maps to Kirchhoff’s current law: currents from multiple OTA outputs sum at a common node without explicit adder circuits. Third, the result is a scalar voltage that can be directly compared against a threshold, bypassing the need for high-resolution ADC.

The query vector q can be: (a) a learned static prototype representing “important” tokens, trained jointly with the transformer; (b) an exponential moving average of recent queries, computed with a simple analog RC filter; or (c) extracted from early attention layers and held constant for subsequent layers. Zero-TPrune [14] demonstrated that attention-graph-based importance from pre-trained models provides reliable pruning guidance without fine-tuning, supporting the viability of a fixed prototype.

3.3 OTA-Based Inner Product Circuit

The operational transconductance amplifier produces an output current proportional to the product of its transconductance g_m and the differential input voltage: I_out = g_m × (V+ − V−). By modulating g_m with one operand and the input voltage with another, we achieve analog multiplication [15]. For inner product computation, an array of N OTAs (where N is the compressed embedding dimension) has each OTA’s transconductance programmed to represent one element of the query vector q, while the differential inputs receive the corresponding key element kᵢ. Output currents from all OTAs sum at a common node, implementing the dot product. Subthreshold OTA designs have demonstrated power consumption as low as 20-75 nW per amplifier [24, 25].

The programmable transconductance is achieved through floating-gate biasing. Values can be programmed with 8-10 bit precision and retained indefinitely without refresh. For the static query prototype approach, this programming occurs once during model deployment, avoiding the write endurance concerns that plague memristor and phase-change memory approaches.

3.4 Threshold Comparison and Adaptive Control

The summed current from the OTA array is converted to a voltage via a transimpedance amplifier, then fed to a programmable comparator. The comparator reference voltage sets the importance threshold θ. Hysteresis (5-10% of threshold) prevents oscillation near the boundary. The comparator output is a digital signal that directly gates token forwarding, building a binary mask indicating which tokens participate in subsequent digital attention.

A fixed threshold may be suboptimal across different inputs and layers. We propose an adaptive mechanism: a low-pass filtered version of the comparator output generates an “average activity” signal that modulates the threshold voltage through analog feedback. If pruning is too aggressive, the threshold decreases; if too permissive, it increases. This ensures useful sparsity levels even as input distributions shift.

4. Precision and Energy Analysis

4.1 Precision Requirements

Low precision (4-6 bits effective) is sufficient for the pre-filtering task. The goal is not to compute exact attention scores but to identify clearly unimportant tokens. Errors in the “gray zone” near the threshold have minimal impact since the digital system performs precise computation on retained tokens. Empirical studies on quantized attention show that 4-bit attention scores maintain accuracy within 1% of floating-point baselines for most tasks. The analog noise floor, device mismatch, and limited transconductance linearity collectively limit effective precision to approximately 6 bits, which is adequate for our purposes.

4.2 Energy Consumption: An Honest Accounting

This analysis is essential because it reveals a genuine limitation. Consider 512-dimensional embeddings compressed to 64 dimensions for analog processing. With 64 OTAs at 100 nA bias current and 1V supply, static power is approximately 6.4 µW. Dynamic power adds roughly 1 µW. The analog core total: under 10 µW per token stream.

However, the DAC interface overhead must be honestly included. Converting 64 compressed dimensions at 6-bit precision requires 64 DACs. At approximately 1 pJ/conversion/bit for moderate-speed DACs, this adds 64 × 6 × 1 pJ = 384 pJ per token. At 1 MHz token rate, DAC power is approximately 384 mW...dramatically exceeding the analog core power. This is consistent with the broader finding that DAC/ADC peripherals can dominate total power [9, 19].

Mitigation strategies: (a) time-multiplexing fewer DACs across dimensions, trading latency for power; (b) low-resolution (3-4 bit) DACs with accuracy-aware training; (c) switched-capacitor DAC architectures integrated within the FPAA fabric; (d) pairing ATIS with analog sensor front-ends where signals are already analog (e.g., vision transformers processing camera outputs). The optimal strategy depends on deployment scenario and remains an active research question.

Corrected system-level comparison: a digital dot product of 64 compressed dimensions requires 64 MACs at ~1 pJ/MAC = 64 pJ per token. Including DAC overhead, ATIS totals ~391 pJ (DAC-dominated) versus 64 pJ digital. The analog core provides orders-of-magnitude savings, but the DAC bottleneck erases this advantage at the system level unless mitigated. DAC-free or DAC-reduced architectures represent the most important engineering challenge for ATIS.

4.3 System-Level Impact

Despite the DAC overhead challenges, ATIS provides its greatest value at the system level through downstream savings. If analog pre-filtering removes 70% of tokens, the digital attention computation is reduced by 70% in the QKᵀ product and proportionally in subsequent operations. For a sequence of 1024 tokens with 70% pruning, the attention matrix shrinks from approximately 1M elements to approximately 90K elements...an 11× reduction. This compounds across layers: a 12-layer transformer with per-layer analog pruning achieves cumulative sparsity far exceeding single-layer approaches. Even if ATIS’s per-token energy cost is comparable to digital scoring, the downstream savings from reduced digital computation can justify the analog pre-filter stage.

5. Challenges and Open Questions

5.1 Dimensionality Reduction

Modern transformers use embedding dimensions of 768-4096, far exceeding practical analog array sizes. Dimensionality reduction is essential. Options include: (a) random projection, which preserves relative distances via the Johnson-Lindenstrauss lemma...a 768-dimensional vector can be projected to 64 dimensions while preserving pairwise distances within a factor of (1±ε); (b) learned projection matrices trained for importance preservation; (c) selecting embedding dimensions with highest variance. The projection is performed digitally before DAC conversion, reducing both DAC count and analog array size.

5.2 Multi-Head Attention

Transformer attention uses multiple heads (typically 12-96) that attend to different aspects of the input. Three approaches: (a) a single “consensus” score averaged across heads...supported by findings that many heads are redundant; (b) replicated scoring circuits for head groups (e.g., 4 groups for a 12-head model); (c) time-multiplexed FPAA reprogramming to cycle through head-group prototypes. The consensus approach is most practical for initial prototyping.

5.3 Calibration and Drift

Subthreshold OTAs are sensitive to temperature: a 10°C change can shift transconductance by 30-40%. Calibration strategies include periodic auto-calibration, differential architectures for common-mode cancellation, temperature-compensated current references, and on-chip temperature sensors with digital lookup tables. These are well-understood techniques in the analog IC design community but require careful engineering for production deployment.

5.4 Applicability Across Architectures

ATIS is most directly applicable to encoder-style transformers (BERT, ViTs) and the prefill phase of decoder models. For autoregressive LLM generation, ATIS could serve as a KV-cache compressor, analogous to the Heavy-Hitter Oracle (H2O) approach. Vision transformers are a particularly attractive target: image patches are tokenized at the sensor interface where analog signals are naturally available, potentially eliminating DAC overhead entirely.

6. Implementation Roadmap

6.1 Phase 1: Simulation and Algorithm Validation

SPICE-level simulation of FPAA components: model OTA-based inner product circuits with realistic noise, mismatch, and nonlinearity; simulate the complete scoring and thresholding pipeline including DAC models; validate against software baselines using actual transformer attention patterns from DeiT-S and BERT-base; and characterize the precision-sparsity-accuracy tradeoff. A critical deliverable is an honest energy model that includes DAC overhead and identifies deployment scenarios where ATIS provides net energy savings.

6.2 Phase 2: FPAA Prototype

Proof-of-concept on commercial FPAAs (Anadigm AN231E04) or research-grade SoC FPAAs from Georgia Tech’s RASP family with native VMM blocks. A 16-64 dimensional prototype at reduced token rates is the most realistic initial target. The prototype will be integrated with a digital FPGA implementing a small transformer model for end-to-end measurements of scoring accuracy, latency, energy consumption, and task accuracy impact.

6.3 Phase 3: Custom ASIC Development

Target specifications: 256-1024 dimension support, sub-microsecond latency per token, under 100 µW power consumption for the analog core (DAC overhead architecture-dependent), and compatibility with standard digital transformer accelerators.

7. Relationship to the Gradient Papers Series

ATIS occupies a unique position in the Gradient Papers: it is the hardware research layer. While Papers I-IV describe software-level protocol architecture, organizational design, and engineering methodology, Paper V asks: can the computational bottleneck in transformer inference...the operation that Citrate nodes must perform thousands of times per second...be addressed at the silicon level?

Paper I (Citrate Technical Paper) specifies that each network node hosts transformer models ranging from 0.5 to 25 GB and performs inference as part of consensus participation. The embedding vectors committed to blocks (Paper I, Section 2.2; Paper II, Section 6.1) require passing shared reference inputs through the node’s local model. ATIS could reduce the per-inference energy cost of this operation by pre-filtering tokens before attention computation, directly improving the economics of node operation.

Paper II (Paraconsistent Consensus) proposes that nodes submit d-dimensional embeddings (d=768 default) at every block (~0.5 seconds). Each embedding computation requires a full transformer forward pass. If ATIS can reduce the attention computation cost by 70% per forward pass, the bandwidth and energy overhead of embedding generation described in Paper II (approximately 6-7 KB per block) becomes more sustainable for resource-constrained nodes.

Paper IX (The Medusa Paradigm) emphasizes biological signal processing as a design principle. Cnidarian nerve nets operate entirely in the analog domain...electrochemical signal propagation, graded potentials, and threshold-based firing. ATIS is the most literal realization of this biological principle: analog signal processing for threshold-based filtering of information before expensive digital computation. The nerve net does not digitize every signal; it pre-filters through analog biophysics and only propagates signals that exceed a threshold. ATIS does the same for transformer tokens.

Current status. ATIS is at the theoretical framework stage (Phase 1 of the roadmap). No prototypes have been built. The connection to the Citrate Network is aspirational...if ATIS proves viable through the research roadmap, it could serve as a hardware optimization layer for Citrate validator nodes, but this depends on outcomes that have not yet been demonstrated.

8. Conclusion

ATIS proposes a novel approach to transformer efficiency that leverages reconfigurable analog computing for token importance scoring. By performing approximate importance estimation in the analog domain before digital attention computation, ATIS targets elimination of the majority of tokens from expensive digital processing while consuming microwatt-scale power in the analog core. The key insight is that reconfigurable analog systems...specifically FPAAs...provide a unique platform for this middleware role that is distinct from the fixed-function ASIC approaches pursued by other analog-digital hybrid accelerators.

We have provided an honest assessment of the challenges, most critically the DAC interface bottleneck. The energy advantage of the analog core is substantial, but system-level benefit depends on effective DAC overhead mitigation...an active area of research across the broader analog computing community. Vision transformer applications, where analog sensor signals can potentially bypass DAC conversion entirely, represent the most immediately promising deployment scenario.

This paper opens a new design space at the intersection of reconfigurable analog computing and neural network sparsity exploitation. Within the Gradient Papers series, it represents the hardware research layer...the investigation of whether biological-inspired analog pre-filtering can make on-chain AI inference economically sustainable.

Acknowledgments

This work was developed through a collaborative human-AI research process, with Claude (Anthropic) contributing to literature review, critical analysis of energy claims, identification of the DAC bottleneck as a primary challenge, and manuscript preparation. The authors affirm that all technical claims have been verified against published literature and that limitations are transparently disclosed.

References

[1] Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 30.

[2] Kim, S., et al. (2022). Learned Token Pruning for Transformers. Proc. ACM SIGKDD, pp. 784-794.

[3] Rao, Y., et al. (2021). DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. NeurIPS 34.

[4] Hasler, J. (2020). Large-Scale Field-Programmable Analog Arrays. Proc. IEEE, 108(8), pp. 1283-1302.

[5] Schlottmann, C. R., & Hasler, P. E. (2011). A Highly Dense, Low Power, Programmable Analog Vector-Matrix Multiplier. IEEE J. ETCAS, 1(3).

[6] Moradifirouzabadi, A., et al. (2024). An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing. Proc. IEEE A-SSCC.

[7] Le Gallo, M., et al. (2023). A 64-core mixed-signal in-memory compute chip based on phase-change memory. Nature Electronics, 6, pp. 680-693.

[8] Hall, T. S., et al. (2005). Developing large-scale field-programmable analog arrays. Int. J. Embedded Systems, 1, pp. 179-192.

[9] Burr, G. W., et al. (2020). Analog architectures for neural network acceleration based on non-volatile memory. Applied Physics Reviews, 7, 031301.

[10] Ambrogio, S., et al. (2023). An analog-AI chip for energy-efficient speech recognition. Nature, 620, pp. 768-775.

[11] Wang, H., et al. (2024). Efficient memristor accelerator for transformer self-attention. Scientific Reports, 14, 24112.

[12] Zhou, H., et al. (2022). Photonic matrix multiplication lights up photonic accelerator and beyond. Light: Sci. & Appl., 11, 30.

[13] Sebastian, A., et al. (2025). Analog in-memory computing attention mechanism for fast and energy-efficient LLMs. Nature Computational Science.

[14] Wang, H., et al. (2024). Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph. Proc. IEEE/CVF CVPR.

[15] Texas Instruments. (2018). Demystifying the Operational Transconductance Amplifier. Application Report SBOA117A.

[16] Jo, H., & Kim, J. (2024). A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning. arXiv:2407.20485.

[17] Xu, Z., et al. (2024). HARDSEA: Hybrid Analog-ReRAM and Digital-SRAM Accelerator for Dynamic Sparse Self-Attention. IEEE Trans. VLSI Systems, 32(3).

[18] Bal, S., et al. (2024). Xpikeformer: Hybrid Analog-Digital Hardware Acceleration for Spiking Transformers. arXiv:2408.08794.

[19] Wang, Y., et al. (2025). An overhead-reduced, efficient, fully analog neural-network computing hardware. Science Advances, 11, eadv7555.

[20] Kim, Y., et al. (2022). Extreme Partial-Sum Quantization for Analog CIM Accelerators. ACM J. ETCS, 18(4).

[21] Navardi, M., et al. (2024). ADC/DAC-Free Analog Acceleration with Frequency Transformation. IEEE Trans. VLSI Systems.

[22] Bai, Z., et al. (2025). HyAtten: Hybrid Photonic-digital Accelerator for Attention Mechanism. arXiv:2501.11286.

[23] Hasler, J. (2022). The Potential of SoC FPAAs for Emerging Ultra-Low-Power Machine Learning. J. Low Power Electron. Appl., 12(2), 33.

[24] Magnelli, L., et al. (2014). Design of a 75-nW, 0.5-V subthreshold CMOS operational amplifier. Int. J. Circuit Theory and Applications, 42, pp. 967-977.

[25] Akbari, M., et al. (2017). A 63-dB gain OTA operating in subthreshold with 20-nW power consumption. Int. J. Circuit Theory and Applications, 45, pp. 843-858.

Appendix A: Cross-Paper Parameter Consistency

Table A1. Citrate Network Parameters Referenced in This Paper

Parameter

Value

Source

Block time

~0.5 seconds (2 BPS)

Paper I, Section 2.2

Embedding dimension (d)

768 (default), configurable

Paper II, Appendix A2

Per-block embedding overhead

~6-7 KB (d=768, float32)

Paper II, Section 6.2

Checkpoint interval

10 blocks (~5 seconds)

Paper I, Section 2.3

Node model range

0.5-25 GB

Paper I, Section 2.1

Target pruning rate (ATIS)

60-80% of tokens

This paper, Section 3.1

Analog core power

<10 µW per token stream

This paper, Section 4.2

DAC overhead (uncorrected)

~384 mW at 1 MHz (64-dim)

This paper, Section 4.2

Effective analog precision

4-6 bits

This paper, Section 4.1

───

This paper is part of the Gradient Papers series published by the Cnidarian Foundation.

Correspondence: larry@cnidarianfoundation.org

Builds on

Referenced by