Behavioral Issues
Behavior-Driven Development as Methodology for Agentic System Engineering
Larry Klosowski
The engineering methodology paper. Establishes Behavior-Driven Development with Gherkin specification as the required methodology for building agentic systems on Citrate. Every feature is specified before it is coded. Red tests come before green tests. This discipline is what keeps the agent swarm predictable at scale.
Abstract
AI coding agents...Claude Code, Cursor, Windsurf, Aider, Replit Agent...can generate functionally correct code but lack the engineering discipline that prevents scope drift, architectural violations, and silent regressions over extended development sessions. This paper argues that Behavior-Driven Development (BDD), specifically the Given/When/Then specification language (Gherkin) combined with the red-green-refactor cycle, provides the missing structural discipline for agentic system engineering. The argument is grounded in practitioner experience: the author ran three concurrent agentic builds (IKWE.ai, Citrate Network components, and the Polyp Framework) using BDD-first methodology across multiple agent platforms and observed consistent patterns of agent behavior improvement when constrained by executable specifications versus free-form instructions. We formalize these observations as a methodology and connect them to the Citrate Network’s Mentorship Protocol (Paper III): the BDD test suite functions as the mentor, the coding agent as the mentee, and passing tests as successful knowledge transfer. This is a practitioner essay, not an empirical study. We report observations from real projects but do not claim statistical significance. We propose experimental methodology for validating the observations.
Keywords: behavior-driven development, BDD, Gherkin, agentic systems, test-driven development, AI coding agents, red-green-refactor, specification language, software engineering methodology
1. Introduction
The AI coding agent revolution has inverted a classical software engineering assumption: that the bottleneck in development is writing code. Agents can generate hundreds of lines of syntactically correct, functionally plausible code per minute. The new bottleneck is specification...telling the agent what to build with sufficient precision that the output matches the intent, integrates with existing architecture, and does not silently break previously working functionality.
This paper makes a specific claim: Behavior-Driven Development (BDD), as formalized by Dan North [1] and implemented through the Gherkin specification language [2], provides a natural and effective discipline for constraining agentic code generation. The Given/When/Then template forces the developer to specify observable behavior before the agent writes implementation code. The red-green-refactor cycle...write a failing test, make it pass, clean up...provides incremental checkpoints that prevent scope drift. And the executable specification document (.feature files) provides living documentation that persists across agent sessions, solving the context window problem that plagues long-running agentic builds.
Practitioner basis. The observations in this paper come from running three concurrent projects using BDD-first methodology with AI coding agents between October 2025 and February 2026: (a) IKWE.ai, a HIPAA-compliant healthcare chatbot built with TypeScript, LangChain, and Cucumber.js; (b) components of the Citrate Network’s consensus layer; and (c) the Polyp Framework, a universal project scaffold for agentic IDEs. These projects used different agent platforms (Claude Code, Windsurf/Cascade, Cursor, Replit Agent) and different technology stacks, providing a degree of variation in the observations...but not controlled experimental conditions.
Implementation status. This paper describes engineering methodology...the builder’s workflow for developing systems described in Papers I-III. It does not modify the Citrate protocol. We use [Observation] for patterns noticed across projects, [Practice] for recommended workflows derived from experience, and [Hypothesis] for testable predictions about agent behavior under BDD constraints.
2. Background: BDD and Gherkin
2.1 Origins and Principles
Behavior-Driven Development was introduced by Dan North in 2003-2006 [1] as a response to recurring confusion in Test-Driven Development (TDD). North observed that developers new to TDD consistently struggled with three questions: where to start, what to test, and what not to test. By replacing the word “test” with “behavior,” North reframed the practice: instead of testing implementation details, developers describe the next behavior the system should exhibit. This shift from verification to specification is the key insight.
BDD draws from three traditions. First, TDD’s red-green-refactor cycle [3, 4]: write a failing test, make it pass with minimal code, then refactor. Second, Eric Evans’ Domain-Driven Design [5] and its emphasis on ubiquitous language...a shared vocabulary between technical and business stakeholders that permeates the codebase. Third, acceptance test-driven development (ATDD), where acceptance criteria are defined before implementation and serve as the definition of done [6].
2.2 The Gherkin Specification Language
Gherkin is a structured natural language format for describing software behavior [2]. Its core template...Given/When/Then...decomposes behavior into preconditions, actions, and expected outcomes:
Feature: User Authentication
As a healthcare provider
I want to authenticate with MFA
So that patient data remains HIPAA-compliant
Scenario: Successful MFA login
Given a registered provider with email "dr@clinic.org"
And MFA is enabled for their account
When they submit valid credentials
And they enter the correct MFA code
Then they should receive an authenticated session
And the session should expire after 30 minutes of inactivity
Each scenario is executable...tools like Cucumber.js [7], pytest-bdd [8], or Behave map Given/When/Then steps to code functions. The feature file serves simultaneously as specification, test, and documentation. This triple function is what makes Gherkin particularly suited to agentic development, where the agent needs all three in a single parseable format.
2.3 The Red-Green-Refactor Cycle
Kent Beck’s TDD cycle [3] proceeds in three phases. Red: write a test that fails (confirming the behavior does not yet exist). Green: write the minimum code to make the test pass (confirming the behavior now exists). Refactor: restructure the code without changing behavior (confirmed by the still-passing test). The cycle repeats for each new behavior.
The critical property of this cycle for agentic development is that each phase produces a verifiable checkpoint. The red phase confirms the test infrastructure works. The green phase confirms the implementation satisfies the specification. The refactor phase confirms no regressions. An agent that follows this cycle can be stopped and resumed at any checkpoint without losing progress...each phase’s output is committed to version control and can be verified independently.
3. The Agent Discipline Problem
3.1 Observed Failure Modes
[Observation] Across three concurrent projects using multiple agent platforms, the author observed five recurring failure modes when agents operated without BDD constraints:
Scope drift. Given a task like “implement user authentication,” agents frequently expanded scope mid-implementation...adding password reset flows, email verification, admin panels...without being asked. Each addition was individually reasonable but collectively produced a system much larger and more complex than specified.
Architectural amnesia. In long development sessions (>2 hours), agents progressively forgot architectural decisions made earlier in the session. A module that was specified to use dependency injection would, fifty messages later, be reimplemented with hardcoded dependencies. This is a context window problem: earlier decisions fall out of the attention window as the conversation grows.
Silent regressions. When agents modified existing code to add new features, they frequently broke previously working functionality without detecting or reporting the breakage. Without a test suite enforcing prior behavior, there was no mechanism to surface these regressions.
Stub proliferation. Agents often created stub implementations (functions that return placeholder values or throw “not implemented” errors) and then lost track of which stubs remained unimplemented. In the IKWE.ai project, a post-build audit revealed 23 stub functions that the agent had created and forgotten across multiple sessions.
Platform inconsistency. The same specification given to different agent platforms (Claude Code vs. Windsurf vs. Cursor) produced architecturally incompatible implementations. Without a shared specification format, each agent interpreted the requirements differently, making cross-platform agent collaboration impractical.
3.2 Why BDD Addresses These Failures
Each failure mode maps to a BDD mechanism that mitigates it:
Table 1. Agent Failure Modes and BDD Mitigations
Failure Mode
BDD Mitigation
Mechanism
Scope drift
Feature file defines exact scope
Agent can only implement specified scenarios
Architectural amnesia
.feature files persist across sessions
Specification survives context window limits
Silent regressions
Prior scenarios remain executable
New code must pass all existing tests
Stub proliferation
Red phase catches unimplemented stubs
Failing tests explicitly surface gaps
Platform inconsistency
Gherkin is platform-agnostic
Same .feature file works across all agents
The mitigation is structural, not behavioral...it does not require the agent to be “more careful” or “remember better.” The .feature file is an external artifact that constrains the agent’s behavior regardless of its internal state. This is the engineering insight: the discipline comes from the process, not the practitioner.
4. The BDD-First Agentic Workflow
4.1 The Workflow
[Practice] Based on experience across three projects, we recommend the following workflow for agentic development:
Step 1: Specification. The developer (human) writes Gherkin feature files describing the desired behavior. This is the developer’s primary creative contribution...deciding what the system should do. The agent does not write feature files; the developer does.
Step 2: Red. The agent writes step definitions (the code that maps Gherkin steps to executable functions) and runs them. All tests fail. This confirms the test infrastructure is correctly configured and that the desired behavior does not yet exist.
Step 3: Green. The agent writes implementation code to make the failing tests pass. The agent is constrained: it should write the minimum code necessary to pass the current scenario, not anticipate future scenarios.
Step 4: Refactor. The agent refactors the passing code...extracting functions, improving naming, applying architectural patterns...while keeping all tests green. The developer reviews the refactored code.
Step 5: Commit and advance. The passing scenario is committed to version control with a conventional commit message (e.g., “🟢 FEAT: authentication ... MFA login passes”). The next scenario is activated. Return to Step 2.
4.2 The Developer’s Role
In this workflow, the developer’s role shifts from code author to specification author and quality reviewer. The developer writes the Gherkin specifications (deciding what the system should do), reviews the agent’s implementation (verifying it does what was specified without unintended side effects), and makes architectural decisions (choosing between approaches when the specification admits multiple valid implementations). The agent writes the code, the tests, and the documentation...but always within the constraints of the developer’s specification.
This is not a claim about all development contexts. For exploratory prototyping, research code, or one-off scripts, BDD adds overhead that may not be justified. The claim is specific: for production systems that must maintain correctness over extended development sessions with AI coding agents, BDD-first methodology reduces the failure modes described in Section 3.
5. BDD as Mentorship: Connection to Paper III
The BDD cycle maps directly to the Mentorship Protocol described in Paper III. The test suite is the mentor: it encodes the specification of correct behavior. The coding agent is the mentee: it receives guidance (failing tests) and demonstrates learning (passing tests). The red-green-refactor cycle is the mentorship loop: the mentor identifies a gap (red), the mentee fills it (green), and the mentee refines its work (refactor).
Table 2. BDD → Mentorship Protocol Mapping
BDD Concept
Mentorship Protocol (Paper III)
Citrate Network (Papers I-II)
Feature file
Performance profile
Per-node capability metrics at checkpoint
Failing test (Red)
Identified weakness
Belnap F/N state on input class
Passing test (Green)
Successful knowledge transfer
LoRA adapter improves node accuracy
Refactor
Double-loop learning
Routing model restructures coordination
Test suite
Central Oracle’s knowledge base
Committed embeddings + state vectors
Commit
Checkpoint
BFT finality checkpoint (~5 seconds)
Note on analogy boundaries. This mapping is structural, not operational. A .feature file is not literally a performance profile, and a failing test is not literally a Belnap state classification. The claim is that both systems implement the same organizational pattern...targeted guidance based on observed weakness...through different mechanisms. The organizational learning theory in Paper III provides the design rationale for why this pattern works; this paper provides the engineering methodology for how to implement it in practice.
6. Gherkin for Adapter Validation
[Practice] In the Citrate Network, LoRA adapters are generated at BFT checkpoints and distributed to nodes (Paper II, Section 4.3). Before an adapter is registered on-chain, it should pass a validation suite. We propose that this validation suite be expressed in Gherkin, providing human-readable acceptance criteria for adapter quality:
Feature: LoRA Adapter Validation
As the Citrate Network
I want to validate adapters before on-chain registration
So that poisoned or degraded adapters are rejected
Scenario: Adapter improves target domain
Given a node with baseline accuracy of <baseline> on <domain>
When the adapter generated at checkpoint <cp> is applied
Then accuracy on <domain> should be >= <baseline> + <threshold>
Scenario: Adapter does not degrade other domains
Given a node with baseline accuracy across all domains
When the adapter is applied
Then accuracy on non-target domains should not drop > 5%
Scenario: Adapter passes fraud proof verification
Given the adapter hash committed at checkpoint <cp>
When any validator replays the generation from committed state
Then the regenerated adapter should match the committed hash
This is a proposed application, not an implemented feature. The Gherkin scenarios above describe what the validation suite should test; the step definitions that implement these scenarios require the testing infrastructure described in Paper II, Section 8. The point is that Gherkin provides a natural specification language for on-chain validation criteria...one that is readable by developers, auditors, and governance participants who are not machine learning specialists.
7. Practitioner Observations
The following observations are drawn from the author’s experience across three concurrent agentic builds. They are reported as practitioner observations, not experimental results. We do not claim these observations generalize beyond the specific projects, agent platforms, and developer described here.
7.1 Agent Compliance with BDD Constraints
[Observation] When provided with .feature files in the project root, all tested agent platforms (Claude Code, Windsurf/Cascade, Cursor, Replit Agent) recognized the Gherkin format without explicit instruction and generated step definitions that mapped to the specified scenarios. Agent compliance was highest when the .feature file was included in the initial prompt context and lowest when it existed only on disk (requiring the agent to read it proactively). This suggests that BDD discipline works best when the specification is surfaced in the agent’s attention window, not merely available in the file system.
7.2 Regression Detection Rate
[Observation] In the IKWE.ai project, switching from free-form instructions to BDD-first methodology mid-project produced a noticeable reduction in regressions. Before BDD adoption (first 3 weeks), the author manually detected an average of roughly 4 regressions per development session that the agent had not flagged. After BDD adoption (remaining 5 weeks), the test suite caught most regressions automatically, with the author manually detecting roughly 1 per session. These are approximate counts from memory and development logs, not controlled measurements. We do not know what fraction of regressions went undetected in either period.
7.3 Context Window Survival
[Observation] Feature files proved effective at preserving architectural decisions across agent sessions. When starting a new session with a fresh context window, providing the agent with .feature files and the existing test suite was sufficient to bring it up to speed on the project’s behavioral requirements. Without these artifacts, the onboarding prompt required extensive prose description of prior decisions, which was error-prone and incomplete. The Polyp Framework was specifically designed to address this: its conversational initialization system generates .feature files as part of project scaffolding, ensuring that every agentic session begins with executable specifications.
7.4 Cross-Platform Portability
[Observation] The same .feature files were used across Claude Code, Windsurf, and Cursor during Polyp Framework development. All three platforms generated compatible step definitions from the same Gherkin scenarios, confirming that Gherkin serves as a platform-agnostic specification language for agentic development. The step definitions differed in implementation style (Windsurf preferred class-based patterns; Claude Code preferred functional patterns), but the behavioral specifications...and therefore the acceptance criteria...were identical.
8. Experimental Hypotheses
The practitioner observations in Section 7 suggest hypotheses that should be tested under controlled conditions. We have not conducted these experiments.
8.1 Hypothesis 1: BDD Reduces Agent-Introduced Regressions
Claim: Coding agents operating under BDD constraints (given .feature files, required to run tests at each commit) introduce fewer regressions per feature implementation than agents operating with free-form natural language instructions.
Proposed methodology: Select 20 well-defined feature implementation tasks of comparable complexity. Assign each task to two conditions: (a) BDD-first (agent receives .feature file and must pass tests) and (b) free-form (agent receives equivalent natural language description). Use the same agent platform for both conditions. Measure: regression count (detected by a hidden test suite not shown to the agent), implementation completeness, and total time to completion. Report with confidence intervals.
8.2 Hypothesis 2: Feature Files Improve Cross-Session Continuity
Claim: An agent starting a new session with .feature files and existing tests produces output more consistent with prior sessions than an agent starting with prose description alone.
Proposed methodology: Build a partial implementation over 3 sessions. Start session 4 under two conditions: (a) .feature files + test suite provided and (b) prose summary of prior work provided. Measure: number of architectural inconsistencies with prior sessions, time to first productive commit, and adherence to established patterns. Assess with blind code review by developers unfamiliar with the conditions.
8.3 Hypothesis 3: Gherkin Portability Across Agent Platforms
Claim: The same .feature files produce behaviorally equivalent implementations across different agent platforms (Claude Code, Cursor, Windsurf, Aider), while free-form instructions produce architecturally divergent implementations.
Proposed methodology: Give 5 different agent platforms the same 10 .feature files. Separately, give 5 platforms equivalent free-form natural language instructions. Measure: test pass rate (using a common test suite), API compatibility (can the implementations be swapped without client-side changes), and architectural similarity (measured by module dependency graphs). If Gherkin produces higher cross-platform consistency, it validates its role as a universal specification language for agentic development.
9. Relationship to the Gradient Papers Series
Paper I (Citrate Technical Paper) is the system that BDD methodology is applied to build. The GhostDAG consensus, LVM, and AI precompiles described in Paper I were developed using the workflow described in Section 4.
Paper II (Paraconsistent Consensus) proposes adapter validation at checkpoints. Section 6 of this paper proposes Gherkin as the specification language for that validation.
Paper III (The Mentorship Protocol) provides the organizational learning theory. This paper provides the engineering practice. The BDD test suite is the operational realization of the Central Oracle’s mentorship function: it encodes performance expectations and provides targeted feedback.
Paper IX (The Medusa Paradigm) describes the Strobilation Pipeline principle...the cnidarian reproductive process where juvenile polyps differentiate through progressive stages. The red-green-refactor cycle is the engineering realization of this principle: each cycle produces a more capable system through incremental differentiation. This is an inspirational analogy, not a formal equivalence.
10. Conclusion
Behavior-Driven Development provides a natural and effective engineering discipline for AI coding agents. The Gherkin specification language gives developers a structured way to express intent that survives context window limits, works across agent platforms, and produces executable acceptance criteria. The red-green-refactor cycle provides incremental checkpoints that prevent scope drift, detect regressions, and surface unimplemented stubs. And the BDD-as-mentorship mapping connects this engineering practice to the organizational learning theory in Paper III, showing that the same patterns that make human organizations effective...targeted guidance, observable behavior, progressive mastery...make agentic development effective too.
The observations reported here are from a single practitioner across three projects. They suggest that BDD-first methodology reduces common agent failure modes, but they do not constitute controlled evidence. The experimental hypotheses in Section 8 describe how to test these observations rigorously. Until those experiments are conducted, this paper is a practitioner report...useful as engineering guidance, honest about its evidential limitations.
References
[1] North, D. (2006). Introducing BDD. Better Software Magazine / dannorth.net. Originally published March 2006.
[2] Wynne, M., & Helmé, A. (2012). The Cucumber Book: Behaviour-Driven Development for Testers and Developers. Pragmatic Bookshelf.
[3] Beck, K. (2002). Test-Driven Development: By Example. Addison-Wesley Professional.
[4] Astels, D. (2003). Test-Driven Development: A Practical Guide. Prentice Hall.
[5] Evans, E. (2003). Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley.
[6] Adzic, G. (2009). Bridging the Communication Gap: Specification by Example and Agile Acceptance Testing. Neuri Limited.
[7] Cucumber.js. (2025). Cucumber for JavaScript. https://github.com/cucumber/cucumber-js
[8] pytest-bdd. (2025). BDD library for pytest. https://github.com/pytest-dev/pytest-bdd
[9] Pereira, L., Sharp, H., de Souza, C., Oliveira, G., Marczak, S., & Bastos, R. (2018). Behavior-Driven Development benefits and challenges: Reports from an industrial study. EASE 2018.
[10] Irshad, M., Britto, R., & Petersen, K. (2021). Adapting Behavior-Driven Development (BDD) for Large-Scale Software Systems. Journal of Systems and Software, 177, 110944.
[11] Binamungu, L. P., Embury, S. M., & Konstantinou, N. (2018). Maintaining Behaviour-Driven Development Specifications: Challenges and Opportunities. SANER 2018, IEEE, pp. 175-184.
[12] Klosowski, L. (2026). Citrate: Protocol Specification for an AI-Native BlockDAG Network. The Gradient Papers No. I. Cnidarian Foundation.
[13] Klosowski, L. (2026). Paraconsistent Consensus: Federated Meta-Learning Over BlockDAG Finality Checkpoints. The Gradient Papers No. II. Cnidarian Foundation.
[14] Klosowski, L. (2026). The Mentorship Protocol: Organizational Learning Theory for Decentralized Agent Swarm Orchestration. The Gradient Papers No. III. Cnidarian Foundation.
[15] Klosowski, L. (2025). The Medusa Paradigm. Cnidarian Foundation Working Paper.
[16] Senge, P. M. (1990). The Fifth Discipline: The Art and Practice of the Learning Organization. Doubleday.
[17] Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
[18] Couto, T., Marczak, S., Callegari, D., Móra, M., & Rocha, F. (2022). On the Characterization of Behavior-Driven Development Adoption Benefits: A Multiple Case Study. SBQS 2022.
Appendix A: Cross-Paper Parameter Consistency
Table A1. Citrate Parameters Referenced in This Paper
Parameter
Value
Source
Block time
~0.5 seconds (2 BPS)
Paper I, Section 2.2
Checkpoint interval
10 blocks (~5 seconds)
Paper I, Section 2.3
BFT committee
100 validators, 67 signatures
Paper I, Section 2.3
LoRAFactory precompile
Address 0x1003
Paper I, Table 1
Adapter rank (r)
16 (default)
Paper II, Appendix A2
Fraud proof window
100 blocks (~50 seconds)
Paper I, Section 3.3
Adapter regression threshold
5% (proposed)
This paper, Section 6
───
This paper is part of the Gradient Papers series published by the Cnidarian Foundation.
Correspondence: larry@cnidarianfoundation.org