Cairn · Jun 8, 2026 ↑Quality and Delivery

CI/CD Maturity in the Age of Agentic Engineering

How a small high-throughput team can evolve from CI to continuous delivery to selective continuous deployment without outrunning its safety systems · ~24 min read · Suggested by Bob engineeringoperations

devops ai security culture

AI agents can make code appear faster than a small team can review, validate, operate, or recover from it. CI/CD maturity is the discipline of increasing release automation only as verification, observability, rollback, and human accountability catch up. The goal is not to deploy everything automatically; the goal is to make each change small, traceable, testable, observable, reversible, and safe to release.

devopsaisecurityculture

Executive summary

Agentic engineering changes the rate at which code can be proposed, not the rate at which a team can safely accept risk. A greenfield startup should usually aim first for trustworthy CI, then preview or staging delivery, then continuous delivery with manual production promotion. Selective continuous deployment comes later for low-risk components whose tests, observability, rollback, and ownership have proven themselves.

Key Takeaway

Do not let AI-generated throughput outrun human and automated verification capacity.

Why agentic engineering changes CI/CD

Agents make large, plausible diffs cheap. Scarce work shifts from creation toward supervision: intent setting, review, test design, operational judgment, security review, and production ownership. The delivery system has to absorb more proposals without converting review gaps into production incidents.

Definitions

Continuous integration means frequent integration into a shared mainline with automated verification. Continuous delivery means every successful build is deployable on demand, even if production promotion remains a human decision. Continuous deployment means successful changes automatically flow to production through automated policy and health gates. Deploying code is not the same as releasing functionality to users; feature flags and dark launches separate those moments.

Evidence summary

DORA’s metrics pair throughput with instability: lead time and deployment frequency sit beside change failure rate and failed deployment recovery time. DORA 2024 found AI adoption improved perceived productivity and some local work measures while correlating negatively with delivery performance. DORA 2025 reframed AI as an amplifier of existing organizational strengths and weaknesses. Controlled and field studies show the same pattern: AI can speed some work, but mature codebases, review burden, and weak process can erase or reverse the gain.

The core risk model: throughput rises before confidence rises

Throughput rises before confidence rises. Agents can generate code, tests, docs, migrations, dependencies, and infrastructure edits faster than a two-to-four-person team can reason about blast radius. Raise automation only when review capacity, automated checks, observability, recovery, and governance have caught up.

Stage 0: Local prototype velocity

Stage 0 is fast discovery before production matters. Use agents aggressively for scaffolding, exploration, throwaway implementations, and tests, but keep a hard boundary between prototype and product code. Promote only when a clean checkout builds and tests, basic conventions exist, and a human owns the result.

Stage 1: Trustworthy CI

Stage 1 protects main. Required checks, lint, type checks, unit tests, dependency scanning, secret scanning, reviewer expectations, small PRs, and a rule that failed CI blocks merge establish that agent output must pass the same or higher bar as human output.

Stage 2: CI plus preview and staging delivery

Stage 2 deploys every mainline change to preview, dev, or staging. The team starts exercising deployment automation, config, secrets, migrations, smoke tests, artifact traceability, and environment recreation before production release automation depends on them.

Stage 3: Continuous delivery with manual production promotion

Stage 3 is the best first serious production target for most early startups. Production deploy is automated and boring, but a human still chooses promotion. Feature flags, backups, observability, alerting, named ownership, incident basics, and tested rollback make manual approval a business/risk gate rather than a technical crutch.

Stage 4: Selective continuous deployment

Stage 4 auto-deploys only low-risk components after automated checks and health gates. Good candidates include internal tools, documentation, non-critical services, additive backend changes, and flagged components that can be canaried or rolled back safely.

Stage 5: Broad continuous deployment with governed autonomy

Stage 5 is advanced: most ordinary changes flow automatically, while high-risk changes are detected and gated. It requires mature tests, observability, SLOs, progressive delivery, provenance, SBOMs, dependency governance, incident management, scoped agent permissions, and clear human accountability.

Stage gates and checklists

Useful gates are lightweight but real: type checks, lint, unit and integration tests, contract tests, human ownership, small PRs, agent permissions, dependency and secret scanning, immutable artifacts, smoke tests, feature flags, migration rehearsals, dashboards, alerts, backups, and product acceptance. Security and supply-chain gates belong early because agents increase dependency and code volume quickly.

Metrics for high-throughput agentic teams

Track DORA-style metrics plus agentic workflow pressure: PR size, review latency, flaky tests, build duration, rollback frequency, deployment rework, escaped defects, percentage of flagged changes, percentage auto-deployed, and cause categories for incidents. Do not treat deployment frequency or AI-generated lines of code as productivity.

Recommended path for a greenfield startup

Start at Stage 1 immediately. Reach Stage 2 before shared team development gets busy. Reach Stage 3 before real customer production usage. Adopt Stage 4 selectively after rollback, observability, flags, and tests are real. Treat Stage 5 as the result of demonstrated maturity, not a starting identity.

Anti-patterns and failure modes

Warning signs include red CI that stays red, large agent-generated PRs with shallow tests, manual production approval because nobody trusts automation, staging drift, untested rollback, permanent flags, health checks that only verify process start, and metrics that reward motion while defects and rework rise.

Final recommendation

Continuous delivery with manual production promotion is not timid. For a small greenfield team near first customers, it keeps deployment automation improving while preserving human product and risk judgment.

Appendix: continuous deployment eligibility checklist

Before auto-deploying a component, verify low blast radius, reviewable change size, meaningful tests, automated deploys, artifact traceability, safe secrets, safe migrations, post-deploy checks, observability, rollback, flags, scanning, provenance where needed, and explicit exclusions for destructive or sensitive work.

Discussion prompts

Ask which delivery bottleneck appears first when PR volume doubles, which component is the first plausible selective-CD candidate, and where manual approval is adding judgment versus compensating for weak automation.

References

The references combine DORA, Google Cloud, GitHub/Microsoft/IBM/METR productivity research, NIST SSDF, SLSA, OWASP, Martin Fowler delivery references, and related Cairns on quality gates and AI-native workflow migration.

The tempting story is simple: if agents can write more code, the team should deploy more code. That story is incomplete. Agentic engineering increases the volume and speed of proposed change; it does not automatically increase correctness, maintainability, security, operability, or product judgment. A startup can move faster with agents, but only if its delivery system can verify, review, observe, recover from, and govern a higher rate of change.

This cairn is a maturity model for that work. It is written for a greenfield startup before pilot customers, with a small team, high code churn, and first production engagements approaching. It is deliberately non-prescriptive about the end state. Continuous deployment may become right for parts of the system. It may never be right globally. The point is to move toward the next operating model when the evidence says the team can absorb it safely.

Key Takeaway

The goal is not to deploy everything automatically as soon as possible. The goal is to make every change small, traceable, testable, observable, reversible, and safe to release.

Executive summary

CI/CD maturity is not a badge ladder. It is a risk-management ladder. Each stage increases automation only after the team proves it can preserve or improve confidence at the new speed.

For a small agentic startup, the practical default is:

Stage 1 immediately: trustworthy CI, branch protection, required checks, small PRs, human ownership of every agent-generated change.
Stage 2 before shared development gets busy: automatic deploys to preview, dev, or staging so deployment automation is exercised continuously before production depends on it.
Stage 3 before real customer production usage: continuous delivery with manual production promotion, feature flags, backups, observability, alerting, and tested rollback or roll-forward.
Stage 4 selectively: automatic production deployment for low-risk components after the team has real evidence that checks, health gates, and recovery work.
Stage 5 only with proof: broad continuous deployment is an outcome of mature delivery, not a starting goal.

Continuous deployment requires continuous delivery. Continuous delivery does not require continuous deployment. A team can keep every successful build deployable on demand, promote production manually, and operate at a high level for a long time.

Why agentic engineering changes CI/CD

Agentic workflows lower the cost of producing code-shaped artifacts. Agents can scaffold, refactor, test, document, migrate, and open pull requests. They can also generate plausible code that is wrong, tests that assert the implementation rather than the requirement, dependencies nobody needed, migrations that are technically valid but operationally dangerous, and infrastructure edits whose blast radius is not obvious from the diff.

That changes the delivery conversation in three ways.

Review capacity becomes the first bottleneck. A two-to-four-person team can have ten branches in flight if agents make branch creation cheap. The team still has only two-to-four people who can accept risk. Small PR discipline is no longer a style preference; it is load control.

Verification becomes a design problem. Passing tests are useful only if the tests encode the right expectations and cannot be quietly weakened in the same PR. Agents shift human work from typing code to evaluating evidence: what was tested, what changed, what assumptions moved, what could fail in production, and whether the change should exist at all.

Operations arrive earlier. A greenfield team can produce deployable surface area quickly. That does not mean it can diagnose incidents, restore data, rotate secrets, unwind migrations, or explain customer impact. Delivery automation without operational readiness turns local speed into production uncertainty.

Warning

Agentic engineering turns weak review, weak tests, weak ownership, and weak observability into scaled failure modes. The agent amplifies the system it runs inside.

Definitions

The words matter because teams often skip arguments by using the same acronym for different operating models.

Continuous Integration is frequent integration into a shared mainline with automated verification. In practice, this means small changes, a protected main branch, required checks, and a habit of keeping main buildable.

Continuous Delivery means every successful build is deployable to production on demand. The release to production may still require a human or business decision. DORA describes continuous delivery as the ability to release changes quickly, safely, and sustainably, with the software kept in a deployable state throughout its lifecycle.

Continuous Deployment means every successful change automatically flows to production, subject to automated policy, verification, and health gates. The absence of a manual production approval is the key difference.

Deploy versus release: deploying code puts it into production infrastructure. Releasing functionality exposes behavior to users. Feature flags, config flags, dark launches, canaries, and percentage rollouts let a team deploy code before releasing a feature.

Agentic engineering is software development where AI agents can generate, modify, test, refactor, document, or propose code changes with varying levels of autonomy.

Supervisory engineering is the human work of directing, evaluating, correcting, constraining, and accepting or rejecting AI-generated output. It includes architecture, product judgment, security review, operational risk acceptance, and ownership of production consequences.

Evidence summary

DORA’s original contribution was not simply that faster delivery is good. It was the pairing of throughput and stability. The common DORA metrics are lead time for changes, deployment frequency, change failure rate, and failed deployment recovery time. The first two measure flow. The second two keep the team honest about instability.

That distinction matters more under AI-assisted development. The DORA 2024 report found broad enthusiasm for AI and improvements in measures such as perceived productivity, documentation quality, code quality, and review speed, while also reporting that AI adoption was associated with worse software delivery performance in that year’s data. Google summarized the finding plainly: AI can help local work while not automatically improving delivery performance.

DORA 2025 narrowed the point. Its headline finding is that AI acts as an amplifier: it magnifies existing strengths and weaknesses. Organizations with strong underlying systems get more value; fragmented organizations get more fragmentation at speed. The report also moved attention away from tool adoption by itself and toward capabilities around platforms, value streams, user focus, and organizational conditions.

The broader research picture is mixed in the same way. The GitHub Copilot controlled experiment found developers completed a bounded JavaScript task much faster with Copilot. Field experiments at Microsoft, Accenture, and another large company found productivity gains, especially for less experienced developers. IBM’s enterprise case study found perceived productivity benefits, but not uniformly across users or tasks.

The counterweight is important. METR’s randomized controlled trial of experienced open-source developers working in mature repositories found that early-2025 AI tools made tasks slower in that setting, even though developers expected and perceived speedups. The study does not prove AI slows all engineering. It proves context matters: mature codebases, high quality bars, prior codebase familiarity, and review or correction burden can offset generation speed.

The operational reading is straightforward: AI can increase local speed and perceived productivity. That is valuable. It is not the same thing as improved delivery performance. Delivery performance improves when the whole system changes: smaller changes, better tests, reliable CI, review discipline, deployable artifacts, observability, recovery, and product feedback.

The core risk model: throughput rises before confidence rises

The failure mode is not mysterious. Agentic teams can generate change proposals faster than they can verify them. Confidence is slower because it depends on many assets that do not appear automatically with the code: test suites, environment parity, safe migration patterns, dashboards, alert thresholds, rollback rehearsals, incident habits, security scanning, dependency policy, and human domain understanding.

That creates a predictable curve:

Code volume rises.
PR size or PR count rises.
Review latency rises.
Tests are added, but often shallowly at first.
CI becomes noisy or slow.
Deploys are batched because the team no longer trusts the path.
Releases get riskier even though each local task felt faster.

The maturity model below exists to interrupt that curve. Each stage asks one question: what must be true before the next level of automation reduces risk rather than merely increasing motion?

Stage 0: Local prototype velocity

Stage 0 is for discovery before production matters. Agents can be used aggressively for scaffolding, exploration, prototypes, throwaway implementations, documentation, and test spikes. The team should exploit the cheapness of generation while keeping a hard boundary between prototype code and production-intent code.

Dimension	Stage 0 posture
Goal	Learn fast without pretending the prototype is production.
Required practices	Source control from day one, basic project conventions, reproducible local setup, lightweight coding standards, scripted build/test commands, explicit labels for prototype versus product code.
Agentic risks	Nobody owns generated code, prototype shortcuts become architecture, generated setup steps are not reproducible, hidden dependencies appear.
Gate to Stage 1	Clean checkout can build and test, main exists and is treated as protected, tooling is scriptable, basic automated tests cover core logic, AI-generated code never merges without human ownership.
Metrics to watch	Time from clean checkout to running app, number of manual setup steps, test existence for core logic, prototype code promoted to product code.
Anti-patterns	“The agent wrote it, but nobody owns it”; no clean way to recreate dev/test state; prototype shortcuts silently become production architecture.

Stage 0 can be gloriously messy inside a sandbox. It should be disciplined at the boundary where anything becomes product code.

Stage 1: Trustworthy CI

Stage 1 establishes the contract. Every change is automatically checked before merge. Agent output must pass the same bar as human output, and usually a more explicit one because agents are good at producing confident-looking bulk.

Dimension	Stage 1 posture
Goal	Main is protected, buildable, and defended by automatic checks.
Required practices	Branch protection, required CI on PRs, compile/type checks, lint/format checks, unit tests, basic integration tests where useful, dependency scanning, secret scanning, reviewer expectations, small PR discipline, PR template with intent/test evidence/risk notes, AI-risk checklist for non-trivial changes, failed CI blocks merge.
Agentic risks	Large plausible diffs, shallow tests, rubber-stamp review, hallucinated or unnecessary dependencies, PR volume exceeding review capacity.
Gate to Stage 2	CI is reliable and fast enough that developers do not bypass it, flaky tests are rare and fixed, PRs remain reviewable, core domain logic has meaningful tests, dependencies and secrets are scanned, build artifacts are reproducible or versioned.
Metrics to watch	CI pass rate, build duration, test flake rate, PR size, review latency, dependency findings, secret-scan findings.
Anti-patterns	CI red for long periods, “fix later” merges, huge agent-generated PRs, superficial snapshot tests as confidence theater, style-only reviews that miss product/security/architecture risk.

Stage 1 is the first serious target. If a startup does nothing else, it should do this early. Strong CI is cheaper to install before the team has normalized bypasses.

Stage 2: CI plus preview and staging delivery

Stage 2 exercises deployment automation continuously before production release automation. Every mainline change deploys automatically to a non-production environment: preview, dev, staging, or an equivalent shared target.

Dimension	Stage 2 posture
Goal	Deployment stops being a special event, at least outside production.
Required practices	Automated deploy to preview/dev/staging, infrastructure as code or reproducible environment setup, environment-specific config and secrets management, automated lower-environment migrations, post-deploy smoke tests, API contract tests where boundaries exist, thin critical-path end-to-end tests, build once and promote the same artifact, release note or changelog automation, deploy markers in observability.
Agentic risks	Runtime assumptions not covered by unit tests, broken deployment descriptors, config drift, migration mistakes, hidden service coupling, infrastructure edits with misunderstood operational effects.
Gate to Stage 3	Mainline deploys automatically to a shared or preview environment, environments can be recreated without heroics, migrations are tested before production, smoke tests catch obvious broken deploys, the team can identify which commit/artifact runs where, secrets stay out of repos/prompts/logs/docs, rollback or roll-forward is understood.
Metrics to watch	Non-prod deploy success rate, smoke-test failure rate, environment creation time, migration failure rate, artifact traceability, staging drift incidents.
Anti-patterns	Manually maintained staging, testing one artifact and deploying another, production migration rituals, preview environments nobody checks, deploy failures debugged only through tribal knowledge.

Stage 2 is where delivery becomes a muscle. The team learns whether its IaC, secrets, migrations, and smoke tests are real while the blast radius is still low.

Stage 3: Continuous delivery with manual production promotion

Stage 3 is the best default first production target for an early startup. Production deployment is automated and routine, but a human still chooses when to promote a build. The approval is a business and risk gate, not a substitute for automation.

Dimension	Stage 3 posture
Goal	Production can be deployed on demand in a boring, repeatable way.
Required practices	One-command or button-click production deployment, manual approval as business/risk gate, feature flags for incomplete or risky work, dark launches where useful, backward-compatible migrations, backups and restore process, logs/metrics/traces where useful, critical dashboards, error monitoring, actionable alerts, named production owner during and after deploy, incident response basics, documented and tested rollback or roll-forward, security checks in CI/CD, dependency and container scanning, immutable artifact discipline and ideally provenance.
Agentic risks	Change volume outpaces product/design/architecture review, features appear faster than value can be validated, subtle data model mistakes, tests pass but operational expectations fail, humans become approvers rather than reviewers.
Gate to Stage 4	Production deploys are boring, rollback/roll-forward has been rehearsed or used, bad deploys are detected quickly, feature flags control meaningful release risk, migrations follow expand/contract or equivalent patterns, DORA-style throughput and instability metrics are tracked, change failure rate is known, production ownership is clear, low-risk components have enough test and observability confidence to consider automation.
Metrics to watch	Lead time, deployment frequency, change failure rate, failed deployment recovery time, rollback frequency, incident count by cause, alert quality, flag coverage, manual approval wait time.
Anti-patterns	Manual approval because nobody trusts tests, deploys require one senior person every time, production health is unclear, rollback is theoretical, releases are batched into large bundles, feature flags accumulate permanently.

For a pre-pilot or early-pilot product, Stage 3 is not conservative in a bad way. It gives the team automated production deployment while keeping the release decision close to customer, founder, and operational context.

Stage 4: Selective continuous deployment

Stage 4 begins selectively, not globally. Low-risk changes or low-blast-radius services can deploy automatically after passing automated checks. The team should choose candidates by risk class, not by enthusiasm.

Good candidates include internal tools, documentation sites, non-critical services, components protected by feature flags, additive backend changes, services with strong automated tests and low migration risk, and changes that can be canaried or rolled back safely.

Dimension	Stage 4 posture
Goal	Low-risk production changes flow automatically through policy and health gates.
Required practices	Progressive delivery with canary/ring/blue-green/percentage rollout, automated health gates, post-deploy smoke tests, synthetic checks for critical flows, error-rate/latency/saturation/business-metric monitoring where appropriate, automated rollback or very fast human rollback, kill switches, named responder coverage during deployment windows, exception path for high-risk changes, policy-as-code where useful, stronger supply-chain controls such as signed artifacts, provenance, SBOMs, pinned dependencies, trusted builders.
Agentic risks	Many safe-looking small changes collectively degrade quality, automated deploys convert review gaps into incidents, agents modify tests and code together, blast radius is not obvious from the diff, systems optimize for deployment frequency over customer value.
Gate to Stage 5	Low-risk CD has operated successfully for a meaningful period, change failure remains acceptable, recovery time is low and improving, deployment rework is tracked, alerts are actionable, incidents receive lightweight reviews, health gates have caught or been validated against real failures, criteria for auto-deploy eligibility are explicit, agent-generated PRs are constrained by repo instructions and policy.
Metrics to watch	Auto-deploy success rate, canary aborts, automated rollback count, health-gate precision, deployment rework, low-risk component change failure, manual exception rate.
Anti-patterns	Global continuous deployment because “we are high velocity,” no distinction between low- and high-risk changes, health checks only verify process start, frequency celebrated while defects rise, nobody can explain why a change was safe to auto-deploy.

Stage 4 should feel boring in the components where it applies. If it feels daring, the team is probably compensating for missing gates with optimism.

Stage 5: Broad continuous deployment with governed autonomy

Stage 5 is advanced. Most normal changes can flow to production automatically, while high-risk changes are detected and gated. This is not a badge of honor. It is a consequence of durable delivery, observability, incident, security, and governance practices.

Dimension	Stage 5 posture
Goal	Broad automation with explicit human accountability and risk-sensitive gates.
Required practices	Mature CI/CD, progressive delivery, strong tests across unit/integration/contract/migration/security/critical E2E, mature observability and SLOs, automated policy checks, provenance, signed artifacts, SBOMs, dependency governance, mature feature-flag lifecycle, safe migration patterns, on-call and incident management, post-incident learning, clear accountability for agent-generated and agent-modified changes, scoped agent permissions, no CI/review/secrets/policy bypass, measurement of customer value and product outcomes.
Agentic risks	Bad architecture scales faster, generated output is accepted without domain understanding, review burden shifts to architecture/product/security/ops, metrics are gamed, automation hides sociotechnical problems until production exposes them.
Ongoing gates	Maintain low change failure and fast recovery, track rework and escaped defects, watch PR size/review latency/flaky tests/dependency risk/flag debt/cognitive load, keep architecture review for cross-cutting work, require manual gates for destructive migrations, auth/security, billing, privacy-sensitive changes, irreversible data changes, major dependency upgrades, infrastructure permissions, and contractual customer-visible behavior.
Metrics to watch	All prior metrics plus SLO impact, escaped defects, incidents attributable to AI-generated or AI-modified code, feature flag debt, reviewer load, dependency risk trend, customer outcome metrics.
Anti-patterns	Agents can self-approve, protected branches are bypassable, manual gates disappear for irreversible work, DORA metrics become quotas, product outcomes are invisible.

Stage 5 is where automation is trusted because the team has earned that trust in production. It is not where a startup begins.

Stage gates and checklists

The checklist should be small enough to use and strict enough to matter. A startup does not need enterprise ceremony, but it does need gates that fail closed when a high-throughput workflow creates more change than people can manually inspect.

Code quality gates: type checks, lint and format checks, unit tests, integration tests, contract tests at service boundaries, migration tests where data changes, property or mutation testing where correctness is subtle, coverage expectations focused on meaningful behavior rather than vanity percentages.

Review gates: human owner for every change, small PRs, reviewer assignment, risk labeling, test evidence in the PR, architecture review triggers for cross-cutting changes, explicit product acceptance for customer-visible behavior.

Agent-specific gates: repository instructions, allowed tool lists, sandboxing, no secret access unless explicitly required and scoped, no direct push to protected branches, no CI bypass, no self-approval, AI-risk checklist for non-trivial changes, disclosure when an agent generated or materially modified the change.

Security gates: SAST, dependency scanning, secret scanning, container scanning, IaC scanning, license checks, threat modeling for sensitive areas, secure defaults in templates, security review triggers for auth, authorization, cryptography, billing, privacy, and data export.

Supply-chain gates: immutable artifacts, build once and promote, build provenance, SBOMs, signed artifacts where appropriate, pinned dependencies, trusted build runners, dependency update policy, reproducible builds where practical.

Deployment gates: smoke tests, post-deploy checks, deploy markers, rollback or roll-forward, feature flags, kill switches, migration safety, environment parity, config validation.

Operational gates: logs, metrics, traces where useful, dashboards for critical paths, actionable alerts, SLOs or service-level expectations, named responder ownership, incident review habit.

Product gates: feature flags, dark launches, product acceptance, analytics, customer-impact review, support/readiness notes when customer behavior changes.

Data gates: backups, restore tests, migration rehearsals, expand/contract migrations, data retention and privacy checks, manual gates for irreversible changes.

Metrics for high-throughput agentic teams

Metrics should reveal bottlenecks, not create quotas. DORA metrics remain useful because they keep throughput and instability in the same frame:

Change lead time: how long it takes a committed change to reach production or a deployable production artifact.
Deployment frequency: how often production deployment happens.
Failed deployment recovery time: how quickly the team recovers when a deployment fails.
Change failure rate: the share of production changes that cause degraded service, incidents, rollback, hotfix, or urgent remediation.

Agentic teams should add pressure metrics around the new bottlenecks:

Deployment rework rate: changes that require follow-up because the original deployment was incomplete, risky, or wrong.
Escaped defects: bugs found after release, especially customer-visible ones.
PR size: files changed, lines changed, and conceptual size.
Review latency: time waiting for human review and time spent reviewing.
Test flake rate: flaky failures as a percentage of test runs.
Build duration: slow gates invite bypasses.
Rollback frequency: rollback and roll-forward are useful, but frequent use signals upstream gaps.
Percentage of changes behind flags: useful when interpreted by risk class.
Percentage auto-deployed versus manually promoted: a health metric only if paired with failure and rework data.
Agent-generated or agent-assisted change ratio: collect only if it does not incentivize larger diffs or less review.
Incident attribution: test gap, review gap, migration issue, dependency issue, prompt/agent issue, product misunderstanding, operational gap.

Do not optimize for deployment frequency alone. Do not treat AI-generated lines of code as productivity. Do not measure agent contribution in ways that reward larger diffs. Do not use DORA metrics as quotas. Do not let agents update tests and implementation together without scrutiny. Passing tests are evidence, not proof of product correctness.

Tip

The most useful metric in an agentic workflow may be review latency per unit of PR size. It tells you whether code generation is outrunning the team’s ability to understand what it is accepting.

Recommended path for a greenfield startup

The startup-realistic target is Stage 3: continuous delivery with manual production promotion. That is the posture that lets a small team deploy on demand without pretending automated checks can yet make every production decision.

Start with strong CI, not production automation. Keep PRs small because review capacity is the first bottleneck in agentic workflows. Require every agent-generated change to have a human owner. Separate deploy from release with feature flags. Build deployment automation before removing human production approval. Treat observability and rollback as prerequisites, not later enhancements.

Standardize paved roads early, but keep the platform small: repo templates, CI templates, service templates, test patterns, secrets handling, deployment scripts, and agent instructions. Do not build an internal platform as an act of faith. Build it where repeated work has already appeared.

Automate security and supply-chain checks early. Agents can increase dependency count, generated code volume, and configuration surface quickly. The cheapest time to add secret scanning, dependency scanning, and artifact discipline is before the first customer incident makes them urgent.

Keep humans responsible for architecture, product judgment, risk acceptance, and production ownership. Supervisory engineering is still engineering.

Anti-patterns and failure modes

The dangerous failures are usually recognizable before they cause an incident.

Pretend CI: required checks exist but stay red, flaky tests are ignored, or failures are merged with a promise to fix later.

Confidence theater: agents generate tests that prove their implementation, not the requirement; snapshot tests grow while meaningful behavior remains uncovered.

Review collapse: PRs become too large to review, reviewers comment on naming while missing data, security, product, or operational risk.

Staging fiction: staging drifts from production, deploys are hand-maintained, and nobody knows whether a green staging deploy predicts a safe production deploy.

Artifact confusion: the team tests one build and deploys another, or cannot identify which commit is running in each environment.

Manual approval as crutch: production promotion is manual because nobody trusts the pipeline. That is not continuous delivery; it is a warning light.

Rollback theater: rollback is documented but never rehearsed, or migrations make rollback impossible without the team admitting it.

Flag debt: feature flags are added for release control and never removed, turning product state into archaeology.

Global CD by identity: the team enables continuous deployment everywhere because it wants to be high velocity, without proving eligibility by component and risk class.

Metric gaming: deployment frequency rises while rework, incidents, escaped defects, and customer confusion rise with it.

Final recommendation

For a greenfield startup using agents, the right target is usually Stage 1 immediately, Stage 2 before shared team development becomes busy, Stage 3 before real customer production usage, Stage 4 selectively after production observability and rollback are proven, and Stage 5 only after the team has evidence that automation is reducing risk rather than merely increasing motion.

Continuous delivery with manual production promotion can be the right operating model for a long time: before product-market fit, while the system changes rapidly, while customer commitments are still forming, and while production observability and incident discipline are still becoming real.

Agentic engineering increases proposed change volume before it increases delivery confidence.
Continuous delivery is the first serious production target; continuous deployment is optional and risk-classed.
Small PRs, human ownership, reliable CI, observability, rollback, and security automation are speed enablers, not bureaucracy.
Measure stability and rework beside throughput, or the team will optimize for motion.
Keep humans accountable for architecture, product judgment, risk acceptance, and production ownership.

Continuous deployment is not the reward for moving fast. It is the reward for proving that the system can absorb fast change safely.

Appendix: continuous deployment eligibility checklist

Before auto-deploying a service or component, ask these questions. A “no” does not mean the team is weak. It means manual promotion is still doing useful risk work.

Is the component low blast radius, or can the blast radius be constrained?
Are changes small enough to review reliably?
Are tests meaningful across unit, integration, contract, and critical paths?
Can the service deploy without manual environment edits?
Is the same artifact built once and promoted?
Are secrets managed outside repos, prompts, logs, and generated docs?
Are migrations backward-compatible or otherwise safely gated?
Are smoke tests and synthetic checks run after deploy?
Are error rate, latency, saturation, and relevant business signals observable?
Are alerts actionable and routed to a named responder?
Can the team roll back or roll forward quickly?
Are risky features behind flags or kill switches?
Are dependency, secret, container, and IaC scans in place where relevant?
Are artifacts immutable, and is provenance or signing in place where needed?
Are auth, billing, privacy, destructive data, irreversible migration, major dependency, and infrastructure permission changes excluded from auto-deploy?
Can someone explain why this class of change is safe to deploy automatically?

Discussion prompts

Which part of the current delivery system would become the bottleneck first if PR volume doubled next month?
What class of change would be the first safe candidate for selective continuous deployment, and what evidence is still missing?
Where is manual approval adding product or risk judgment, and where is it compensating for weak automation?

References

DORA: Continuous Delivery - DORA's definition of continuous delivery as release-on-demand, safely and sustainably.
DORA Accelerate State of DevOps 2024 Report - 2024 findings on AI adoption, productivity perception, and delivery-performance caveats.
DORA: State of AI-assisted Software Development 2025 - AI as an amplifier of existing organizational strengths and weaknesses.
Google Cloud DevOps research summary - DORA metrics and software-delivery capabilities, including AI-related 2024 findings.
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot - Controlled experiment showing faster completion on a bounded programming task.
The Effects of Generative AI on High-Skilled Work - Field experiments with software developers across Microsoft, Accenture, and another large company.
IBM Research: AI code assistant productivity and experience - Enterprise case study on perceived productivity benefits and uneven outcomes.
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR randomized controlled trial showing context-dependent slowdown in mature repositories.
NIST SP 800-218: Secure Software Development Framework - Baseline secure software development practices across preparation, protection, production, and response.
SLSA: Supply-chain Levels for Software Artifacts - Build integrity, provenance, and supply-chain hardening model.
OWASP Top 10 for Large Language Model Applications - LLM application risks including prompt injection, insecure output handling, supply-chain risk, and excessive agency.
OWASP Secure Coding with AI Cheat Sheet - Practical guidance for reviewing, constraining, and validating AI-generated code.
Feature Toggles - Martin Fowler article explaining feature flags and separating deployment from release.
Martin Fowler Software Delivery Guide - Reference hub for continuous delivery, canary release, blue-green deployment, and related delivery patterns.
Quality Gates: The Contract That Lets You Move Fast - Local Cairns companion on deterministic gates for agent-driven work.
AI-Native Engineering: Can You Migrate Incrementally? - Local Cairns companion on workflow redesign versus tool adoption.

Generated by Cairns · Agent-powered with Claude

← Back to Trailhead