Three Years On: What GenAI Changed and Didn't About Effective Developer Teams

Transformation of a Developer Team

Introduction

A sequel to "Strategy for Effective and Efficient Developer Teams" (December 2022)

Three years ago this month, I published an article arguing that test-driven development, CI/CD investment, and a culture of rigor and transparency were strategic priorities for effective software teams. Since then, we’ve witnessed the most significant shift in development tooling since version control systems went mainstream: the widespread adoption of generative AI. The 2025 DORA State of AI-Assisted Software Development report confirms that 90% of developers now use AI tools at work, a 14% increase from last year alone.

This sequel examines what has changed, what has endured, and what remains uncertain. The central finding from DORA’s multi-year research resonates throughout: AI doesn’t fix a team; it amplifies what’s already there. Strong foundations become stronger. Dysfunction accelerates.

What Endures: Foundations That Still Matter

The Core Metrics Remain Relevant

The four DORA metrics I cited in 2022, deployment frequency, lead time for changes, time to restore service, and change failure rate, remain foundational. The 2024 and 2025 DORA reports reorganized these into throughput metrics (deployment frequency, lead time) and stability metrics (change failure rate, failed deployment recovery time), but the underlying insight holds: teams that deploy frequently, recover quickly, and maintain low failure rates outperform those that don’t.

If anything, these metrics matter more now. AI-assisted development dramatically increases the volume and velocity of code changes. Without the infrastructure to deploy safely and frequently, organizations simply ship bad changes faster and in larger and more dangerous batch sizes.

Test-Driven Development Becomes More Critical, Not Less

When I wrote about TDD in 2022, I argued that writing tests first catches defects early and creates living documentation — and that holds up. The AI era has made this practice even more valuable. Google’s companion guide to the 2025 DORA report emphasizes that the core principles of TDD are more critical than ever, and that AI’s benefits are amplified when combined with these practices.

The reasoning is straightforward. AI generates code rapidly, but that code requires verification. Tests written first serve as specification and guardrails for AI-generated implementations. When you provide tests alongside requirements in a prompt, you give the model concrete success criteria. Empirical studies on AI-assisted coding show that providing LLMs with explicit test cases tends to increase functional correctness of generated code, especially when the model can iterate against failing tests.

The discipline required by TDD of decomposing problems into small, testable increments maps naturally to effective AI-assisted workflows. Small batches remain a critical capability. The DORA AI Capabilities Model identifies “working in small batches” as one of seven foundational capabilities that amplify AI’s benefits, helping teams manage the higher volume and potential instability that AI can introduce.

CI/CD Investment Pays Compound Returns

My 2022 argument for early CI/CD investment has proven prescient. The 2025 DORA AI Capabilities Model identifies “strong version control practices” as one of seven foundational capabilities that amplify AI’s positive impact. With AI accelerating code generation, mature version control and automated deployment become even more critical for managing increased volume and velocity.

The phased deployment practices I described, namely local testing, CI environments, beta phases, preproduction, limited production scope, and progressive rollout, remain essential. The Faros AI analysis of the 2025 DORA findings noted that teams with strong platform foundations see AI productivity gains translate to organizational improvements. Teams with weak infrastructure see individual gains absorbed by bottlenecks downstream.

Culture Still Trumps Tools

The section of my original article on rigor and transparency, psychological safety, blameless postmortems, knowledge sharing, remains entirely applicable. The 2025 JetBrains Developer Ecosystem Survey found that 89% of developers say non-technical factors—job design, clear communication, peer and manager support, and actionable feedback—influence their productivity, compared to 84% who cite technical factors like tool performance.

The DORA research reinforces this. User-centric focus is identified as a prerequisite for AI success. Teams without clear direction see negative impacts from AI adoption, while those with a strong user focus see amplified benefits. AI becomes most useful when pointed at a clear problem and the organization (and its culture) must provide that direction.

What’s Different: The AI Amplifier Effect

The foundations discussed above remain important, but GenAI has also caused some dramatic shifts in daily development work. This is not just faster coding; it’s a systemic transformation of the bottlenecks and primary risks.

This section explores four key areas of changes in the calculus for effective developer teams: bottlenecks, metrics, infrastructure, and methodologies.

Review Becomes the Bottleneck

Perhaps the most significant shift is where constraints now appear in the development workflow. The Greptile State of AI Coding 2025 report captures it: engineering velocity is increasingly constrained by review and verification, not code generation. Median lines of code per developer grew from 4,450 to 7,839 as AI tools act as force multipliers. But this code still needs review.

Research suggests that human reviewers lose effectiveness as PR size grows. SmartBear’s large-scale Cisco study found reviewers are most effective examining 200–400 lines of code at a time, with defect detection dropping significantly beyond that threshold. When AI generates larger PRs more frequently, the review burden grows faster than human capacity. The DevTools Academy analysis argues AI reviewers are becoming essential to maintain pace. The Qodo research found that teams using AI in code review see quality improvements up to 81%, compared to 55% for teams gaining equal speed without AI review.

This suggests a shift in the testing pyramid and review strategy I described in 2022. AI-assisted review at scale, with humans focusing on high-risk changes and architectural coherence, may become standard practice. One useful pattern is a producer/critic pairing: one agent drafts, another tries to break it (tests, edge cases, security, style), and the human reviewer arbitrates. Use AI tools, plus careful review, to keep specs and docs sharp so they remain reliable context for future agent runs.

New Metrics: Rework Rate and Instability

Starting with the 2024 Accelerate State of DevOps research, DORA expanded the classic four key metrics with a fifth metric: rework rate, defined as the proportion of unplanned deployments made to fix production issues. By 2025, DORA provides benchmarks for all five metrics and treats rework rate as a core indicator of stability. This acknowledges a reality of AI-driven development: speed without quality safeguards creates downstream corrective work. The Scrum.org analysis of the DORA findings is blunt: “AI increases throughput. It also increases instability.”

Multiple analyses suggest AI-generated code has higher defect rates than human-written code. CodeRabbit’s analysis found AI code introduces approximately 1.7× more issues, including a higher likelihood of high-severity defects. GitClear’s data shows code churn increasing with AI adoption. Effective teams now explicitly budget for AI-induced defect risk, expanding test suites and targeting review on AI-heavy areas.

Platform Engineering as Distribution Layer

The 2025 DORA research found that 90% of organizations have adopted at least one internal platform. More significantly, high-quality internal platforms correlate directly with an organization’s ability to unlock AI value. Platforms serve as the distribution layer that scales individual AI productivity gains into organizational improvements.

This represents a shift in how we think about infrastructure investment. In 2022, I argued for CI/CD automation primarily in terms of deployment frequency and change failure rate. The platform engineering movement adds another dimension: platforms provide guardrails, shared capabilities, and the observability needed to safely absorb AI-accelerated development.

The Rise of Spec-driven Development

Spec-driven development has entered the scene as a contrast to the looser version of what has come to be referred to as vibe coding. One of the most significant methodological developments since my prior article is the emergence of spec-driven development (SDD). As Thoughtworks noted in their December 2025 analysis, SDD “may not have the visibility of a term like vibe coding, but it’s nevertheless one of the most important practices to emerge in 2025.”

Vibe coding is an “approach to creating software where the developer describes a project or task to a large language model (LLM), which generates code” per Wikipedia, and typically refers to coding without editing or often even looking at the code directly, but instead using an agent to write it and often perform testing and review as well.

These terms are new, and some people still call coding with agents using very structured prompts and specifications vibe coding.

The core idea is a return to first principles: before generating code, write a detailed specification. GitHub released Spec Kit in September 2025; AWS launched Kiro with spec-first workflows; JetBrains integrated spec-driven approaches into Junie. The pattern follows a consistent structure: Specify → Plan → Tasks → Implement, with explicit validation checkpoints between phases.

Why Specs Now?

The shift from “vibe coding” to spec-driven development reflects hard-won lessons about AI’s limitations. As the Red Hat analysis puts it: “AI coding assistants are like those talented musicians, helping us build solutions quickly. But relying solely on impromptu interactions (’vibe coding’) can lead to brilliant bursts of creativity mixed with brittle code that might crumble under pressure.”

Specifications serve three critical functions in AI-assisted development. First, they provide a way for developers and stakeholders to understand and agree on goals before code exists. Second, they give AI agents a North Star to guide larger tasks without getting lost. Third, they transform prompt engineering from an ad-hoc exercise into a version-controlled, human-readable “super prompt” that can be refined over time.

The connection to my 2022 recommendations is direct. I argued for test-driven development because writing tests first forces clarity about interfaces and behavior. Spec-driven development extends this principle: writing specifications first forces clarity about requirements, constraints, and success criteria. Both practices share the insight that articulating intent before implementation improves outcomes.

Open Questions About Spec-as-Source

The SDD community hasn’t reached consensus on a fundamental question: is the specification or the code the ultimate artifact? The Thoughtworks analysis identifies this tension explicitly. At one end, specifications are merely prompts that drive initial code generation and executable code remains the source of truth. At the other, specifications become the primary artifact, with code treated as a byproduct that can be regenerated.

Birgitta Böckeler’s exploration of SDD tools raises practical concerns. Despite larger context windows being cited as enablers, “just because the windows are larger, doesn’t mean that AI will properly pick up on everything that’s in there.” Her experiments found agents ignoring instructions, generating duplicates of existing classes, or going overboard following directives too eagerly.

More fundamentally, if specs become the primary artifact, experienced programmers may find that over-formalized specifications slow down change and feedback cycles, echoing problems from waterfall development. The balance between structure and agility remains a relevant question.

The Bitter Lesson and Its Implications for Practice

Any serious discussion of AI development practices in 2025 must face Richard Sutton’s “The Bitter Lesson.” Sutton’s 2019 essay argues that “general methods that leverage computation are ultimately the most effective, and by a large margin.” Time after time, across chess, Go, speech recognition, computer vision and other applications, approaches that prioritized scale and learning outperformed systems built on hand-crafted human knowledge.

The lesson is “bitter” because it’s humbling. Researchers invest years encoding domain expertise into systems, only to see those systems surpassed by approaches that throw more compute at general learning algorithms. As Sutton writes: “Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.”

Implications for How We Build with AI

The bitter lesson has direct implications for how developers should work with AI tools, especially agentic ones. A thoughtful October 2025 analysis from Nilenso observes that many AI-maximalist programmers have setups “full of text files that describe ‘rules’, ‘modes’, ‘roles’, prompts, or subagents… with lots of ‘pleading’ language (or threats), capitalisation and even step-by-step logic telling an LLM how it should think and act.”

This is precisely what the bitter lesson warns against: baking human assumptions about workflow into systems that are increasingly capable of learning from feedback. The engineer who has digested the bitter lesson “will instead set up an environment that can provide feedback loops to the agent. This setup is simpler and better accommodates frontier reasoning models that are scaled with reinforcement learning by getting out of their way.”

The evolution of coding agents illustrates this. First-generation tools like early Cursor and Copilot relied on chunk-and-embed paradigms, prefilling retrieved code chunks into context windows. Newer tools like Claude Code, Windsurf, OpenHands favor agentic search: tell the AI how to invoke a search and let it figure things out. This simpler architecture embodies the bitter lesson by not baking in assumptions about what information the model needs.

A Tension with Best Practices

Here lies a genuine tension. My 2022 article and much of this sequel advocates for structured practices: TDD, CI/CD pipelines, phased deployment, spec-driven development. These are, in Sutton’s terms, “human knowledge-based methods.” They encode expertise about how software development should work.

Should we expect the bitter lesson to eventually make these practices obsolete? Will future AI systems simply learn what good software looks like from outcomes, rendering our carefully constructed processes unnecessary?

The answer, I believe, is nuanced. The bitter lesson applies most clearly where objectives are easily quantified: victory in chess, matching speech to text, classifying images. Software development objectives are messier. “Good code” depends on context, constraints, organizational values, and future maintainability that’s hard to specify operationally.

More practically, the Nilenso analysis offers guidance: “When the model isn’t good at your task yet, but may get there eventually under the current scaling regime—design an artisanal architecture to build what is needed today, but do so with the knowledge that some day you may have to throw this functionality away.” Best practices remain valuable, but we should hold them loosely, prepared to simplify as models improve.

What Won’t Scale with Compute

Not everything succumbs to the bitter lesson. Current training methods still struggle with reliable context utilization, so practices around context management may persist. Reliable execution, retries, and interface design aren’t solved by bigger models. And fundamentally, defining what “success” means for a software system remains a human responsibility.

This suggests where human expertise remains essential: setting objectives, defining quality, designing feedback loops, and providing the organizational context that models can’t learn from data. The DORA finding that user-centric focus is a prerequisite for AI success aligns with this. Models can learn to code; they can’t learn what’s worth building.

What Remains Unclear: Evolving Questions

The Productivity Paradox of Completion Time vs. Reported Productivity

The most provocative finding of 2025 comes from METR’s randomized controlled trial of experienced open-source developers. Despite 90% AI adoption and widespread reports of productivity gains, the study found that allowing AI actually increased task completion time by 19% even though developers estimated they were 20% faster.

This contradicts vendor studies showing 20-55% speedups and conflicts with self-reported productivity gains. The MIT Technology Review notes that while Stack Overflow’s 2025 survey shows 65% of developers using AI tools weekly, trust and positive sentiment toward these tools fell significantly for the first time.

What explains this? METR’s researchers suggest experienced developers working on familiar codebases may spend additional time understanding, validating, and integrating AI suggestions. The Faros AI analysis proposes an “AI Productivity Paradox": individual output metrics rise (21% more tasks, 98% more PRs), but organizational delivery metrics stay flat. The Jevons Paradox applies to code as well; when writing becomes cheap, we write more, but reading remains expensive.

Optimal Human-AI Collaboration Patterns

Where AI truly accelerates work versus where it can slow things down if misapplied remains task and context dependent. Evidence suggests clear benefits for boilerplate generation, test scaffolding, unfamiliar APIs, and documentation. Benefits are less clear for complex business logic, security-sensitive code, and deeply familiar codebases.

The research suggests effective teams design explicit workflows rather than letting each developer independently figure out how to use AI. Standard prompts, context management, and review patterns reduce friction. But best practices are still emerging, and what works varies significantly by domain and team.

Long-Term Potential for Developer Skills to Atrophy

The MIT Technology Review quotes an engineer who, after using AI tools heavily at work, found himself struggling with tasks that previously came naturally when working without them. “Things that used to be instinct became manual, sometimes even cumbersome.” Just as athletes still perform basic drills, the only way to maintain coding instinct may be to regularly practice without AI assistance.

A Stanford study found employment among software developers aged 22-25 fell nearly 20% between 2022 and 2025. Whether this reflects AI displacement, economic factors, or other causes remains unclear. But the question of what skills developers need to cultivate, and what skills can safely atrophy, has no settled answer.

Practical Implications

Based on the research since 2022, I offer these updated recommendations:

Strengthen foundations before scaling AI. AI amplifies what exists. Organizations with mature CI/CD, strong version control practices, and clear priorities see AI gains translate to organizational improvements. Those without see individual productivity absorbed by infrastructure friction.

Treat AI adoption as organizational transformation, not tool deployment. The value of AI is unlocked by reimagining the system of work it inhabits. This requires investment in platforms, policies, and practices not just tool licenses.

Invest in review infrastructure proportional to AI-assisted code generation. As code generation accelerates, review becomes the constraint. AI-assisted review, expanded automated testing, and explicit quality gates prevent speed from becoming accelerated chaos.

Use TDD and specs as guardrails, not straitjackets. Tests and specifications provide verification and direction for AI-generated code. But use these practices with an expectation they will need to change as models improve. Simpler approaches that leverage feedback loops may prove more effective than elaborate human-designed workflows.

Track rework rate alongside traditional metrics. Speed without stability shifts bottlenecks downstream. Measuring unplanned deployments due to incidents reveals the true cost of AI-accelerated development.

Design for feedback loops, not prescriptive workflows. The bitter lesson suggests that setting up environments where AI can learn from outcomes will outperform elaborate prompt engineering. Invest in defining success operationally and providing clear feedback signals.

Prepare to throw away artisanal solutions. Today’s necessary workarounds may become tomorrow’s technical debt when the next model generation arrives. Build with the knowledge that simplification may be required, and as always try to design your systems to be evolvable.

Conclusion

The 2025 DORA report concludes that AI’s transformative potential in software development remains largely unrealized. While individual productivity gains are real and widespread, translating these into organizational advantages requires intentional system-level changes. Organizations that treat AI adoption as a transformation opportunity by investing in the capabilities that amplify benefits while addressing systemic issues will separate themselves from those that simply deploy tools and hope.

My 2022 thesis mostly holds with an important amendment: investing early in test-driven development, CI/CD infrastructure, and team culture pays dividends, but we must be prepared to evolve these human-knowledge-based practices as the Bitter Lesson’s inexorable march simplifies the tools.

The need to define what success looks like, set direction, and make judgment calls about what’s worth building won’t change, at least soon. AI amplifies execution; humans provide intent. The question isn’t whether to adopt AI, it’s how to become an organization that can use AI for meaningful outcomes.

The most fundamental skill for teams and organizations to have is and always has been the ability to adapt rapidly to change and harness it into useful outcomes.

References

Amazon Web Services. (2025). Kiro: Agentic AI development from prototype to production. https://kiro.dev/

Becker, J., Maas, N., Gupta, U., & Huben, R. (2025, July). Measuring the impact of early-2025 AI on experienced open-source developer productivity. METR. https://arxiv.org/abs/2507.09089

Böckeler, B. (2025). Understanding spec-driven development: Kiro, spec-kit, and Tessl. Martin Fowler. https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html

Brynjolfsson, E., Chandar, B., & Chen, R. (2025, August). Canaries in the coal mine? Six facts about the recent employment effects of artificial intelligence. Stanford Digital Economy Lab. https://digitaleconomy.stanford.edu/wp-content/uploads/2025/08/Canaries_BrynjolfssonChandarChen.pdf

CD Foundation. (2025, October 16). The DORA 4 key metrics become 5. https://cd.foundation/blog/2025/10/16/dora-5-metrics/

CodeRabbit. (2025). State of AI vs human code generation report. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report

DORA. (2025). AI capabilities model. https://dora.dev/ai/

DORA. (2025). Platform engineering. https://dora.dev/capabilities/platform-engineering/

DORA. (2025). User-centric focus. https://dora.dev/capabilities/user-centric-focus/

DORA & Google Cloud. (2025). 2025 state of AI-assisted software development report. https://dora.dev/dora-report-2025

DevTools Academy. (2025). State of AI code review tools in 2025. https://www.devtoolsacademy.com/blog/state-of-ai-code-review-tools-2025/

Faros AI. (2025, July). AI productivity paradox research report 2025. https://www.faros.ai/blog/ai-software-engineering

Faros AI. (2025). Key takeaways from the DORA report 2025. https://www.faros.ai/blog/key-takeaways-from-the-dora-report-2025

GitClear. (2024). Coding on Copilot: 2023 data suggests downward pressure on code quality. https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_quality

GitHub. (2025, September). Spec-driven development with AI: Get started with a new open source toolkit. GitHub Blog. https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/

Google Cloud. (2025). Introducing DORA’s inaugural AI capabilities model. Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/introducing-doras-inaugural-ai-capabilities-model

Google Cloud. (2025). TDD and AI: Quality in the DORA report. https://cloud.google.com/discover/how-test-driven-development-amplifies-ai-success

Greptile. (2025). State of AI coding 2025. https://www.greptile.com/state-of-ai-coding-2025

IT Revolution. (2025). AI’s mirror effect: How the 2025 DORA report reveals your organization’s true capabilities. https://itrevolution.com/articles/ais-mirror-effect-how-the-2025-dora-report-reveals-your-organizations-true-capabilities/

JetBrains. (2025, October). The state of developer ecosystem 2025. https://devecosystem-2025.jetbrains.com/productivity

JetBrains. (2025). Junie, the AI coding agent by JetBrains. https://www.jetbrains.com/junie/

MIT Technology Review. (2025, December 15). AI coding is now everywhere. But not everyone is convinced. https://www.technologyreview.com/2025/12/15/1128352/rise-of-ai-coding-developers-2026/

Qodo. (2025). State of AI code quality in 2025. https://www.qodo.ai/reports/state-of-ai-code-quality/

Raykar, A. (2025, October 14). Artisanal shims for the bitter lesson age. Nilenso Blog. https://blog.nilenso.com/blog/2025/10/14/bitter-lesson-applied-ai/

Red Hat. (2025, October 22). How spec-driven development improves AI coding quality. Red Hat Developer. https://developers.redhat.com/articles/2025/10/22/how-spec-driven-development-improves-ai-coding-quality

Scrum.org. (2025). DORA report 2025 summary (State of AI-assisted software development). https://www.scrum.org/resources/blog/dora-report-2025-summary-state-ai-assisted-software-development

Sequoia Capital. (2025, September). Richard Sutton’s second bitter lesson. Inference. https://inferencebysequoia.substack.com/p/richard-suttons-second-bitter-lesson

SmartBear Software. (2018). Best practices for peer code review. https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/

Sutton, R. (2019, March 13). The bitter lesson. Incomplete Ideas. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Thoughtworks. (2025, December). Spec-driven development: Unpacking one of 2025’s key new AI-assisted engineering practices. https://www.thoughtworks.com/en-us/insights/blog/agile-engineering-practices/spec-driven-development-unpacking-2025-new-engineering-practices

Wikipedia. (2025). Jevons paradox. https://en.wikipedia.org/wiki/Jevons_paradox