The productivity paradox

The research is converging on a counterintuitive pattern: AI tools dramatically increase individual output while leaving system-level delivery roughly flat. Three studies tell the same story from different angles, and once you have the angles in view the apparent contradictions in the popular discourse mostly evaporate.

The first is METR’s randomized controlled trial, published in July 2025. Sixteen experienced developers worked through 246 real tasks in their own open-source repositories — repos with five-plus years of context, 22k+ stars, and over a million lines of code. They used Cursor Pro with Claude 3.5 and 3.7 Sonnet. Tasks were randomly assigned to “AI allowed” or “AI disallowed.” An RCT is the methodological gold standard, and the result was the opposite of what the authors had predicted:

  • Developers forecast a 24% speedup before the study.
  • Developers self-reported a 20% speedup after the study.
  • Actual measured outcome: a 19% slowdown.

A 39-point perception gap, robust across 21 controlled variables. METR’s February 2026 follow-up softened the finding slightly — selection effects (developers refusing to participate without AI) contaminated the rerun — but didn’t refute the original.

The second is Faros AI’s AI Productivity Paradox report, drawing telemetry from over ten thousand developers across 1,255 enterprise teams. The headline numbers tell the same story from a different vantage:

  • +21% tasks completed per developer
  • +98% pull requests merged
  • +154% larger PRs on average
  • +91% longer PR review time per PR
  • Net delivery throughput: essentially flat
Key Takeaway

Coding is a fraction of total cycle time. When AI accelerates the generation step without changing review, QA, and deployment, the gains pile up in front of an unchanged bottleneck.

The Faros report frames this as Amdahl’s Law in plain dress. Even a theoretical 100% coding speedup yields only 15–25% system-level improvement, because coding has never been the bulk of the work. Addy Osmani’s analysis of the same data lands at the same place: the bottleneck moved, and most teams haven’t.

The third is DORA. The 2024 report found that every 25% increase in AI adoption was associated with −1.5% throughput and −7.2% stability. The 2025 report reversed this: AI adoption is now linked to higher throughput. The most useful reading isn’t that one report was wrong — it’s that orgs which invested in workflow redesign started compounding gains twelve to eighteen months in, while orgs that handed out licenses and kept everything else the same are still flat.

Read all three together and the original question — can you migrate incrementally? — answers itself. Incremental adoption as a tool decision doesn’t produce the 10x narrative. Incremental adoption as a workflow redesign sometimes does. The two paths look similar at the start and diverge sharply by month twelve.

What twelve months added (and changed)

Some of the studies above are now a year old, which in this space is a long time. They measured developers working in unchanged workflows with the AI tools and harnesses of early 2025 — before Claude Code matured, before MCP, before multi-agent harnesses were broadly available. The picture in mid-2026 is sharper in places and meaningfully different in others.

METR’s own follow-up moved. The May 2026 self-reported impact survey of 349 technical workers measured a median value uplift of 1.4–2x and speed uplift of 3x, with retrospective estimates of 1.3x in March 2025 → 2x in March 2026 → forecast 2.5x for March 2027. METR staff using current agentic tools measured their own time savings at 2x to over 10x, depending on the individual. The 2025 RCT’s 19% slowdown is still the right warning about dropping tools into unchanged workflows — but the gap between perception and measurement has narrowed, and the underlying tooling has compounded.

DORA 2026 formalized the curve. The April 2026 ROI of AI-Assisted Software Development report introduced a J-Curve model: organizations experience a temporary productivity dip while team workflows adapt, code review overhead grows, and testing and approval processes catch up — then recover and compound. DORA calls the dip “the tuition cost of transformation” and warns that leaders who misread it as failure withdraw funding prematurely. For an illustrative 500-person engineering organization, DORA models ~39% ROI with an 8-month payback period, while accepting a small stability tax — change failure rate rising from 5% to 6%. The 2025 report’s framing of “AI as a multiplier of existing engineering conditions” now has a shape and a dollar figure attached to it.

Adoption is approaching the ceiling. JetBrains’s January 2026 survey found 90% of developers regularly use AI tools at work, with 74% using specialized AI dev tools (not just chatbots) and 64% having engaged with agentic tools (25% regularly, 39% experimentally). The question is no longer whether teams use AI; it’s how the surrounding work has changed.

Harness maturity is now load-bearing. Anthropic’s 2026 Agentic Coding Trends Report reports that engineers use AI in roughly 60% of their work but fully delegate only 0–20% of tasks — the human is still in the loop for most of it. 22% of production deployments coordinate three or more agents, and Model Context Protocol servers crossed 9,400 publicly registered, evidence that the support stack discussed below is no longer aspirational. The same report cites a Claude Code autonomous run in the vLLM codebase — 12.5 million lines, seven hours, 99.9% numerical accuracy — as a concrete capability marker for what mature agents can now do unattended in real brownfield.

But the failure rate is still real. Across enterprise surveys, 88% of agent pilots never reach production, and 22% of production deployments report negative ROI at twelve months. The reported causes line up with the workflow-redesign thesis: unclear success criteria, insufficient tool or data access, and drift in evaluation coverage. The same conditions that made dropping a tool into 2025 produce a 19% slowdown make dropping an agent into 2026 produce a stuck pilot.

Key Takeaway

The 2025 evidence said “tool adoption alone yields 5–20%.” The 2026 evidence says “tool adoption now yields more, but the gap to 10x is still the same gap — workflow, skills, and support infrastructure — and the J-curve quantifies how long it costs to cross it.”

The thesis the rest of this cairn argues — that AI-native engineering is a workflow transformation, not a tool deployment — survives the year in better shape than it arrived in. The new evidence sharpens the recovery curve, raises the ceiling, and gives the migration a credible expected ROI. It does not collapse the bottom of the J.

Size, maturity, and risk tolerance sort the outcomes

Adoption is now nearly uniform; results are bimodal. The same JetBrains/Faros/DORA studies that put adoption at 90% also surface a wide variance in what teams get for it, and three axes — org size, engineering maturity, compliance burden — explain most of the spread. The migration path that works for one segment is usually the wrong path for another.

AI-native startups operate from a different baseline

The headline 10x stories almost all originate from the same shape: small, senior, technically homogeneous teams that wrote their first lines of production code after the harnesses matured.

  • 25% of Y Combinator Winter 2025 startups have 95%+ AI-generated codebases — for that cohort, “AI-augmented” is a backwards way to describe what they’re doing; the human contribution is review, architecture, and product decisions.
  • Andreessen Horowitz portfolio companies report best engineers’ productivity up 10–20x with AI coding tools. The qualifier is doing real work — those are the engineers who already knew how to specify, decompose, and review.
  • The cost-structure argument compounds the productivity one: a solo founder offloading 80–85% of execution to agents runs at single-digit-percent of a traditional team’s burn, which makes the per-engineer multipliers look conservative against the per-company outcome.

For this segment, Track 2 of the two-track recommendation isn’t a pilot — it’s the only track. Track 1 (traditional engineering, AI-augmented) doesn’t exist because there’s no legacy process to keep moving in parallel. The pitfalls are different too: brittle architecture choices made fast and locked in, no institutional memory to push back on agent output, and the bus-factor-zero problem at maximum intensity because there is no senior peer to catch the cowboy.

Mid-sized teams compound — but only if they redesigned

The DORA 2026 J-curve has a different shape for a 30-person engineering org than for a 500-person one. Small teams (1–10 engineers) show 15–20% better per-engineer metrics, and small-batch work amplifies AI’s positive effects on product performance. A 30-engineer team that took the workflow-redesign bet seriously in 2025 is, by mid-2026, often outshipping a 500-engineer team that handed out licenses and kept everything else the same.

The catch is that mid-sized orgs frequently have just enough legacy process to slow Track 2 without enough scale to fund a separate pilot lane. The honest answer for many of them is: pick a single product surface, run it AI-native, accept that the rest of the org learns by watching, and resist the urge to roll out one official process across the whole engineering team.

Large enterprises hit the visibility crisis

Adoption at the top end is essentially universal — 90% of Fortune 100 companies have deployed GitHub Copilot — and the headline reductions are real (33–36% reduction in time spent on development-related activities in large-enterprise studies). What’s also real is that the productivity gain has not converted cleanly into delivery throughput, and the DORA 2026 framing is now more pointed about why.

  • AI tools create a bimodal distribution: 30–40% better cycle time alongside 15–25% higher change failure rates. The stability tax shows up as incidents, not as headline numbers.
  • High pre-AI engineering maturity is not protective. DORA 2026 found no evidence that organizations with strong pre-AI performance are insulated from the quality degradation that comes with high adoption; high-maturity orgs are experiencing the same downstream deterioration as everyone else. The platform foundations help recover faster from the J-curve dip — they do not let you skip it.
  • Visibility is the new bottleneck. With agent work invisible between humans and PR sizes inflated, classic DORA metrics measure less of the real picture. Several 2026 commentaries argue that “DORA metrics are not enough” without telemetry on AI authorship, review depth, and code-comprehension signals.

For large enterprises, Track 2 has to be sponsored as a protected lane, run by senior engineers, on bounded work — and the temptation to declare success based on Track 1’s adoption numbers has to be actively resisted.

Regulated industries hit the governance wall before the workflow wall

The most under-discussed finding in the 2026 data is that compliance is now the dominant barrier to AI-native adoption in regulated sectors, not technical feasibility.

  • 42% of companies abandoned most AI initiatives in 2025 — up from 17% the year before — with compliance and governance failures (not technical failures) as the most common cause.
  • 78% of business executives cannot pass an independent AI governance audit within 90 days.
  • Only 20% of companies have a mature governance model for autonomous AI agents. The remaining 80% operate without the architecture that the EU AI Act, updated HIPAA rules, and the OWASP Agentic AI Top 10 (2026) now effectively require.
  • Enterprise AI coding tools in high-compliance settings see 40% higher credential exposure when consumer-grade products are deployed without enterprise controls.
Warning

For regulated orgs, the two-track migration needs a Track 0 — governance, audit trails, model and data residency, agent identity and permissions, evaluation coverage — before either Track 1 or Track 2 makes sense. Run it in parallel, not after. Skipping Track 0 is what produces the 42% abandonment number.

The implication isn’t that regulated industries can’t be AI-native; it’s that the order of operations is different. The pilot lane that produces the muscle has to live inside the compliance perimeter from day one, or it doesn’t transfer to anything that matters.

The composite picture

Lay the three axes against each other and the spectrum of expected outcomes sharpens. An AI-native startup with high engineering maturity and low compliance burden compounds toward the 10x stories — those are the cases the public narrative is built on. A mid-sized team with high maturity and low compliance burden, taking the redesign bet, lands somewhere in the 2–5x range with a real J-curve to climb. A large regulated enterprise with strong pre-AI maturity but heavy compliance lands at 30–40% cycle-time improvement with a stability tax and a multi-year governance project running underneath. A large enterprise with weak engineering maturity and weak governance is where the failure stories come from — and the 42% abandonment number is mostly drawn from that quadrant.

The two-track migration approach below is written for the broad middle of this spectrum. Adjust the tracks to where your org actually sits before treating it as a literal recipe.

Where 10x lives (and where it doesn’t)

The 5–20% vs 10x range maps cleanly onto three axes. Each one is a place a migration can stall, and each one is independently expensive to fix.

1. Greenfield vs brownfield (really: codebase quality)

The MIT Sloan piece “The Hidden Costs of Coding With Generative AI” names greenfield/brownfield as the dominant risk axis. Greenfield work is low risk, gains are realized, and the technical debt is cheap to rewrite if it goes wrong. Brownfield work compounds existing problems, especially with less-experienced developers, and the cleanup burden falls on the seniors who were already the constraint.

A Tilburg University study of Copilot adoption in open-source projects sharpens this further: productivity gains come overwhelmingly from peripheral (less-experienced) developers. Core developers review 6.5% more code post-Copilot and show a 19% drop in their own original-code productivity. AI velocity isn’t free; it’s redistributed onto the people maintaining the system.

A contrarian refinement is worth knowing. Ström Capital’s “Why AI works better on existing codebases” argues that a well-architected brownfield repo gives agents constraints to imitate, where greenfield invites invented inconsistencies. This doesn’t refute the MIT and Tilburg findings — it refines them. The real axis is codebase quality and pattern coherence, with greenfield-vs-brownfield as a proxy. A clean brownfield monorepo with strong conventions is friendlier to agents than a hand-rolled greenfield repo with no patterns yet established.

By mid-2026, spec-driven development for brownfield work is an active practice rather than a research idea. The pattern is consistent across the field reports: an agent scans the codebase, produces a compact summary of relevant state, and works on a planned change rather than vibe-coding into the existing structure. Long-running autonomous runs are now realistic — Claude Code’s seven-hour pass on the 12.5M-line vLLM codebase at 99.9% numerical accuracy is the headline example — but the technique that gets there is still careful research-plan-implement loops, not “let it cook.”

2. Mindset and skill

Agentic-first development is a different job. The shift is from individual contributor to team lead of a team of agents. That means:

  • Letting go of reading every line of code.
  • Trusting agents with the same autonomy you’d give a human teammate.
  • Specifying intent precisely instead of writing implementation.
  • Maintaining architectural coherence as a primary deliverable.
  • Composing parallel workstreams rather than executing sequential ones.

arXiv 2504.13903 — “From Teacher to Colleague” shows experienced developers frame AI as a junior colleague, while less-experienced developers frame it as a teacher. The reframe is real and consequential: it determines what work gets delegated, how oversight is structured, and what the engineer thinks their own job is.

Most engineers, at some level, emotionally identify with coding. That makes the shift hard for reasons that have nothing to do with technical skill. The engineer who’s proud of their typing speed and their familiarity with the codebase is not naturally the same engineer who’s proud of running a clean delegation system. The crossover is a career transition wearing the costume of a tooling upgrade.

The named load-bearing skill in 2026 is context engineering — the discipline of assembling, curating, and passing the right information to agents at the right time. It’s the agent-team-lead version of what senior engineers used to do implicitly when they explained a feature to a new hire: framing the problem, surfacing the relevant prior art, naming the constraints, deciding what to leave out. The teams that learned to do this deliberately are the ones whose 10x stories survive scrutiny.

3. Workflow and process

Agile, scrum, and PR culture are all built around humans operating at human speed. Agents work like humans in many ways — they need guardrails, coordination, and memory — but they need those structures rebuilt to operate at agent speed. Ticket cadence, sprint boundaries, code review queues, and standup ceremony all assume the bottleneck is human throughput. Once that assumption is wrong, the existing collaboration infrastructure is the constraint.

Warning

This is what “increment the workflow, not just the tool” actually means: rebuild the surrounding system to match the new pace of work, or watch the gains get absorbed by the unchanged bottleneck.

The IC-to-team-lead shift

This deserves its own section because it’s where most enterprises trip. The technology is doing its part; the org chart and the job description aren’t.

Andrew Storms’s piece “Cowboy Coder Is Back. This Time, They Scale” names a specific failure mode. An engineer prompts an agent, the agent emits 800 lines, the engineer skims, tests pass, merge. Repeat ten times a day. Output is enormous; velocity charts look incredible. But nobody on the team has reasoned through that code. The author can’t walk you through it under questioning — they prompted it, didn’t write it. The reviewer can’t either; half the time the reviewer is another agent. Storms calls this bus factor zero: code enters the repository with no human-resident mental model of why it’s the way it is.

This is the same cowboy antipattern the industry spent twenty years building defenses against — code review, pair programming, design docs, collective ownership — now operating at machine speed without the social brakes. Agents don’t have egos, reputations, or peers who can push back. (Kilo’s Speed engineering profile is a useful complement on what good AI-native practice looks like in the same publication.)

The mitigation isn’t to abandon agents. It’s to extend agent-equivalents of the practices that historically contained cowboys:

  • Require comprehension, not just approval. Authors should be able to walk through any meaningful PR without re-prompting the agent to find out what’s in it.
  • Cap PR size. Code review evolved assuming limited human throughput on both sides. Fifty lines can be reviewed. Eight hundred lines gets rubber-stamped.
  • Tag agent involvement. Make AI authorship first-class metadata. Track incident rates and refactor cost on AI-heavy modules vs human-heavy ones, and let the numbers steer where you accept the risk.
  • Reframe tech debt as unread code. The most dangerous code in the repo is no longer the bad code; it’s the unread code — modules nobody has internalized. The bad code can be fixed by anyone who understands it; the unread code stays brittle even when it’s locally clean.
  • Protect deliberate practice for juniors. Engineers who never struggle through hard bugs don’t become senior engineers who can debug under pressure. There has to be a path through the difficulty, not around it.

The harder problem is that agents lack the institutional memory humans accumulate. Five years of “we tried that and it broke” lives in human teammates’ heads automatically; for agents, it has to be made explicit. That requires a support stack — specs, memory layers, review agents, harnesses, coordination tools — that has matured fast over 2025–2026 but is still hard to pick between. MCP servers crossed 9,400 publicly registered by early 2026; 22% of production agent deployments now coordinate three or more agents; the harness layer is no longer a research artifact. Choosing wrong is expensive; waiting to choose is also expensive. The stack we run inside Constructured (beads as durable issue memory, project memory systems, agent-readable architecture docs, harness-level orchestration) is one such bet, but the broader point is that some equivalent has to exist or the agents stay junior forever.

Process friction: why the same tool yields 5–20%

Return to the Faros data for a moment. Review time grew 91% in AI-heavy teams. That’s the system telling you exactly where the choke point moved. The friction isn’t mysterious — it’s the predictable consequence of every assumption that was wired into the workflow before AI arrived.

Pre-AI assumption What breaks with agents
Limited human typing speed Code can be generated faster than reviewed
Devs read all their own diffs Agents produce code humans skim
One commit = one author with one mental model Code arrives with no human-resident model
Sprint cadence matches dev cadence Agents operate at sub-hour cycle times
Standup makes coordination visible Agent work is invisible between humans
Tribal knowledge spreads via teammates Agents have no tribal knowledge by default

Every row in that table is a small infrastructure project. Some of them are tooling problems (review automation, smaller PRs, agent-readable specs). Some of them are coordination problems (different standup format, different ticket cadence, different ownership boundaries). Some of them are talent problems (different hiring profile, different career ladder, different definition of senior). Pick the wrong one to start with and the gains stay flat.

DevOps deserves special mention. By some counts, 80%+ of building and running cloud software is the operational work around the code: infrastructure, deployment, observability, incident response, security posture. That’s the highest-trust zone — handing it to agents requires more confidence in their judgment than greenfield prototyping does. The trust gradient runs roughly:

Definition

Greenfield prototyping → greenfield production → brownfield refactor → brownfield production → DevOps → incident response. Gains arrive in roughly that order. Teams expecting uniform 10x across the gradient will be disappointed; teams that sequence adoption along it can compound wins.

A migration approach that actually works

The two-track recommendation holds up well against the research, and it’s the shape of every AI-native transition that hasn’t stalled.

Track 1 — Traditional engineering, AI-augmented. Existing teams, existing process, AI tools added. Expect 5–20% gains, concentrated on routine tasks — boilerplate, documentation, test scaffolding, simple refactors. Don’t try to make this track produce the 10x narrative; it can’t. Its job is to keep core production work moving while Track 2 develops, and to fund the migration without the disruption of a full rebuild.

Track 2 — AI-native pilot lane. New rules:

  • Smaller specs, written for agent execution.
  • Tight, fast, disposable branches.
  • Aggressive automation, especially for tests and review.
  • Humans as architects and editors, not authors.
  • Different team composition — fewer engineers, more senior on average, more architecture-leaning.
  • Different metrics — system-level outcomes, not individual commits.

The pilot lane proves the pattern on bounded work, develops the muscle, and exposes which support infrastructure your org actually needs. Then you let it expand into adjacent work where the bet is favorable. The expansion order matters as much as the pilot itself:

  1. Greenfield internal tools — lowest stakes, highest learning rate.
  2. Greenfield customer-facing prototypes — still recoverable if it goes wrong.
  3. Brownfield refactor work in well-architected modules — good agent context, contained blast radius.
  4. Brownfield feature work in well-architected modules — confidence compounding.
  5. DevOps automation in low-blast-radius areas — log shippers, dashboards, CI tweaks.
  6. Eventually: brownfield work in legacy or poorly-architected modules; DevOps in production-critical paths.
Tip

You don’t migrate cleanly. You incubate the new operating model, prove it where the bet is favorable, then let it eat the old process where it wins. The chrysalis doesn’t optimize the caterpillar — it becomes something else, while the caterpillar keeps the lights on.

The shape of this is familiar from earlier transitions — agile rollouts, DevOps adoption, microservices — and the failure modes are familiar too. Asking the whole org to convert at once produces theater. Forcing legacy work to use the new operating model first produces backlash. Measuring Track 2 with Track 1’s metrics produces despair. The migration only works when each track has its own job description, its own definition of done, and its own honest expectation of pace.

The 2026 base rate for this not working is sobering. Across enterprise surveys, 88% of agent pilots never reach production, and the named causes — unclear success criteria, insufficient tool or data access, drift in evaluation coverage — are exactly what a serious Track 2 has to solve for. The pilot lane that does reach production is the one that picked a bounded, measurable, owned problem on day one.

What to tell skeptics and enthusiasts

To the skeptic — the one who says “AI doesn’t actually help” — point at the workflow, not the tool. The METR study shows what happens when experienced devs use AI tools inside an unchanged workflow on familiar codebases: they get slower. That’s a real finding and it should be taken seriously. It’s also exactly the predicted outcome of dropping a tool into a system not designed for it. The skeptic’s evidence is genuine; their conclusion is incomplete.

To the enthusiast — the one who says “we’ll get 10x next quarter” — point at Amdahl’s Law and the Faros review-time data. Even an infinite coding speedup caps system gain at the review-and-deploy ceiling. The 10x stories are real, but they come from rebuilt operating models, not from tool adoption. Quarterly horizons aren’t long enough to redesign workflow, build a support stack, and grow the IC-to-architect muscle on the people who used to write the code. The enthusiast isn’t wrong about the destination; they’re wrong about the trip.

To the org asking how to start: greenfield, small, with a willing team. Build the muscle, build the support infrastructure, build the metrics. Let the rest of the org watch and learn. The first AI-native lane in your company doesn’t have to be impressive — it has to be honest, observable, and improving on a curve that the next lane can join when it’s ready.

Summary

  1. The 2025 evidence (METR, Faros, DORA 2024) said AI tools increase individual output while leaving system-level delivery roughly flat. The 2026 evidence (METR May 2026, DORA 2026 ROI, Anthropic 2026 Agentic Coding Trends) raises the ceiling — 1.4–2x value uplift, 39% modeled ROI, 60% of engineering work AI-assisted — without changing the underlying thesis.
  2. DORA 2026's J-Curve formalizes the recovery shape: a productivity dip during workflow adaptation, then compounding gains. Leaders who misread the dip as failure withdraw funding prematurely.
  3. 10x outcomes track three axes — codebase quality and pattern coherence, mindset and skill (especially the IC-to-team-lead shift and context engineering as the named load-bearing skill), and workflow and process. All three usually have to move together.
  4. Outcomes are bimodal by org segment. AI-native startups (small, senior, low compliance) compound toward 10x. Mid-sized teams with redesign discipline land at 2–5x. Large enterprises see 30–40% cycle-time gains with a stability tax. Regulated industries hit the governance wall before the workflow wall — 42% abandonment in 2025, only 20% with mature agent governance — and need a Track 0 alongside Tracks 1 and 2.
  5. Bus factor zero is the dominant failure mode: code merges that nobody on the team has reasoned through. The defenses are the old defenses against cowboy coding, restated for machine speed. Engineers fully delegate only 0–20% of tasks even now; the human is still in the loop for most of it.
  6. Agents lack institutional memory by default. A support stack — specs, memory layers, review agents, harnesses (MCP at 9,400+ servers), coordination tools — is now load-bearing infrastructure, not a nice-to-have. 22% of production deployments coordinate three or more agents.
  7. Run two tracks (or three, when compliance is heavy): traditional-augmented for the existing surface area, AI-native pilot for bounded new work, governance/compliance running in parallel when the org sits in a regulated space. Expand the pilot along the trust gradient (greenfield → brownfield → DevOps → incident response) rather than uniformly across the org. The base rate is brutal: 88% of agent pilots never reach production.
  8. The migration is a career transition wearing a tooling-upgrade costume. Treat it as such — different job descriptions, different metrics, different definition of senior, different hiring profile.

Discussion prompts

  • Pick one team you're close to. Which of the three axes (codebase quality, mindset, workflow) is most under-invested for them today? What would the smallest meaningful intervention look like?
  • Look at the most recent five PRs you've merged. For how many of them can you, right now, without re-opening the diff, narrate what the code does and why it's structured the way it is? What does that ratio say about your bus factor on that work?
  • Locate your org on the size/maturity/compliance grid — AI-native startup, mid-sized post-redesign team, large enterprise with stability tax, regulated industry behind the governance wall, or somewhere in between. Which migration tracks (0, 1, 2) does your segment actually need running, and which one are you under-resourcing today?
  • If your org started an AI-native pilot lane tomorrow, who would you put on it, what work would you give it, and what metric would you watch first? Where does that answer make you uncomfortable, and why?

References

2025 — the original wave

  1. METR (2025): Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity — The RCT showing a 19% slowdown against a 20% self-reported speedup; the cleanest single piece of evidence that perception and measurement had diverged in early 2025.
  2. METR (2026): We Are Changing Our Developer Productivity Experiment Design — Follow-up that softens the slowdown estimate by acknowledging selection effects in the rerun.
  3. Faros AI (2025): The AI Productivity Paradox Report — Telemetry from 10,000+ developers across 1,255 enterprise teams; introduces the +98% PRs / +91% review-time framing and the Amdahl's-Law reading.
  4. DORA (2024): Accelerate State of DevOps Report — Original throughput-negative finding (−1.5% throughput, −7.2% stability per 25% AI adoption).
  5. DORA (2025): State of AI-Assisted Software Development — Frames AI as a multiplier of existing engineering conditions; the prelude to the 2026 ROI report.
  6. Xu et al., Tilburg (2025): AI-Assisted Programming Decreases the Productivity of Experienced Developers — The 19% original-code productivity drop and 6.5% review-load increase on core developers.
  7. arXiv 2504.13903: From Teacher to Colleague — Frames the perceptual difference between experienced and inexperienced developers when working with AI.
  8. MIT Sloan: The Hidden Costs of Coding With Generative AI — Names greenfield vs brownfield as the dominant risk axis and the redistribution of cleanup burden onto seniors.
  9. Addy Osmani: The Reality of AI-Assisted Software Engineering Productivity — Useful synthesis of the Faros and METR results into a single framing readable by non-researchers.
  10. Andrew Storms / Kilo: Cowboy Coder Is Back. This Time, They Scale — Coins "bus factor zero" and maps the cowboy antipattern to agent-driven work.
  11. Kilo: Inside Kilo Speed — Counterpart profile showing what good AI-native engineering practice looks like in the same publication.
  12. Ström Capital: Why AI Works Better on Existing Codebases — Refines the greenfield/brownfield axis into codebase quality and pattern coherence.

2026 — the year the numbers moved

  1. METR (May 2026): Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity — 349-participant survey: median value uplift 1.4–2x, speed uplift 3x; retrospective 1.3x (March 2025) → 2x (March 2026) → forecast 2.5x (March 2027); METR staff measuring 2x–10x on their own work.
  2. DORA (April 2026): ROI of AI-Assisted Software Development — Introduces the J-Curve of value realization, the "tuition cost of transformation" frame, and an illustrative ~39% ROI / 8-month payback model. Names platform foundations, workflow clarity, and team alignment as the dominant variables.
  3. InfoQ (May 2026): New DORA Report Claims Strong Engineering Foundations Drive AI Return on Investment — Concise summary of the DORA 2026 ROI report's J-Curve, the stability tax (5% → 6% change failure rate), and the high-maturity-no-insulation finding.
  4. Anthropic (2026): Agentic Coding Trends Report — 60% of engineering work AI-assisted, 0–20% fully delegated; 22% of production deployments coordinating three or more agents; MCP at 9,400+ public servers; the Claude Code seven-hour vLLM run.
  5. JetBrains (January 2026): Which AI Coding Tools Do Developers Actually Use at Work? — 90% of developers using AI at work, 74% using specialized AI dev tools, 64% engaging with agentic tools; tool-share data on Copilot, Cursor, Claude Code, Antigravity.
  6. Dubach (2026): AI Coding Productivity Paradox — 93% Adoption, 10% Gains — Mid-2026 synthesis showing that the original "tool ≠ workflow" gap is alive and well at the broad enterprise mid-tier.
  7. Faros AI (2026): What METR's Study Missed About AI Productivity in the Wild — Pushes back on METR's lab finding using enterprise telemetry; useful counterpoint for understanding what the May 2026 numbers do and don't capture.
  8. Oobeya (2026): DORA Metrics Are Not Enough in 2026 — Argues the classic DORA four are blind to agent authorship, review depth, and code-comprehension drift; one entry point into the "visibility crisis" framing.
  9. Augment Code (2026): Spec-Driven Development for Brownfield Codebases — Representative of the 2026 practice consensus on running agents against legacy code: research, plan, implement, in separate phases.