The Code Review Crisis: What AI Is Doing to Your Engineering Team’s Quality Gate

Code review has always been the quality gate that software teams rely on most. It is where context gets shared, mistakes get caught, and standards get enforced. It is also, in 2026, the part of most engineering workflows that is under the most strain - and the part where most organisations are doing the least to adapt.

The problem is not subtle. AI coding tools have materially increased the volume of code that developers produce. They have not increased the number of reviewers, the hours available for review, or - critically - the quality of what arrives in the review queue. In many teams, the opposite has happened.

A 2025 analysis by CodeRabbit examining 470 open-source pull requests found that AI-co-authored PRs contained approximately 1.7 times more issues than human-only PRs. Logic errors were 75% more common. Security issues were up to 2.74 times higher. Performance problems - particularly excessive I/O - were eight times more prevalent. Readability issues spiked more than threefold.

These are not small margins. And most teams are reviewing this code with processes designed for a different era.

Why AI code is harder to review - not easier

The naive assumption is that AI-generated code should be easier to review. It is syntactically clean. It is often well-structured. It follows naming conventions. It comes with comments. All of this is true.

The problem is that code review was never primarily about syntax. It was about understanding what the code is doing and whether it is the right thing to do in this specific context. And AI-generated code is harder to review on that dimension - not easier - for a specific reason: you do not know what the model was asked, what context it had, or how the engineer arrived at what they submitted.

When a human engineer writes code, reviewers have a working model of how it came to exist. They know who wrote it, what they were trying to do, how experienced they are, what kinds of mistakes they typically make. That model helps reviewers allocate attention. They know where to look carefully.

With AI-assisted code, that model breaks down. The code may have been generated from a vague prompt, refined through several iterations, and lightly edited by an engineer who understood some but not all of what the AI produced. The reviewer has no way to know which parts were carefully validated and which were accepted because they looked right. They have to approach more of the code with more scrutiny - which takes more time and requires more cognitive load.

DX’s 2025 research captures this in practice: meetings, interruptions, and review delays already cost developers more time than AI saves. When review quality degrades - because volume has outpaced capacity - the net effect is not productivity gain but a different kind of debt accumulation.

The specific failure modes to watch for

The CodeRabbit data is specific enough to be actionable. These are not generic quality concerns - they are the patterns that appear most consistently in AI-generated code, and they are the ones your reviewers need to be calibrated to find.

Logic and correctness errors

The most expensive category. AI tools produce code that is locally correct - it does what a reasonable reading of the prompt would suggest - but globally wrong. It does not account for the edge cases your system actually encounters. It makes assumptions about data shapes, API behaviour, or state that are not documented in the code and are not valid in your context.

This is what I call the competent surface problem: the code passes a quick read and even a test suite, but it fails the context test. An engineer without deep knowledge of the system will not necessarily see the problem. A reviewer under time pressure will not necessarily see it either.

The mitigation is not a faster scan - it is a different question. Instead of “does this code do what it says it does?” the more important question is “does this code do what we need it to do, given what we know about how this system actually behaves?” That is a harder question, and it requires the reviewer to bring context the AI did not have.

Security patterns that look right in isolation

AI tools generate authentication code reliably. They generate authorisation code far less reliably. The most consistent pattern I see - and the research confirms this - is code that correctly verifies a user’s identity but fails to verify whether that identity is entitled to the specific resource or action being requested.

In a REST API, this often means endpoints with a valid token check but no check that the token’s owner is permitted to access the requested data. The code looks correct. The test suite passes. The vulnerability only appears when a user manipulates a resource ID in the URL and retrieves data that is not theirs.

Input validation failures follow a similar pattern. AI tools validate the inputs that were mentioned in the prompt. They do not reason systematically about all the ways an adversary might subvert the input. The happy path is handled. The adversarial path is often not.

Performance at scale

The 8× increase in excessive I/O operations in AI-generated code is the finding that surprises people most. It is a consequence of how AI tools reason about correctness without reasoning about efficiency. An N+1 query is correct - it returns the right data. It is also something that only reveals itself as a problem when the table grows beyond a threshold the developer was not testing against.

Code review is currently the best defence against this. But it requires reviewers who are specifically looking for it, and who have the domain knowledge to know when a query pattern that looks fine at small scale will be catastrophic at production scale.

The AI Code Review Checklist is in the book

The full checklist for reviewing AI-generated code, plus the complete 90-Day Plan and 15-question AI Readiness Assessment, are in The AI-Ready Engineering Team.

Get the Book on Amazon

The alternative view: maybe checklists are not the answer

There is a conventional response to all of this that I want to push back on. The standard advice - and advice I have given versions of myself - is to extend your code review checklist. Add items for authorisation. Add items for input validation. Add items for performance patterns. Train reviewers to apply the checklist more carefully to AI-generated code.

This is not wrong. But it is insufficient, and in some cases it makes the underlying problem worse.

Checklists scale linearly with the number of reviewers you have. AI-generated code volume scales exponentially with the number of developers using AI tools. If your review capacity is already constrained, adding checklist items to each review does not solve the capacity problem - it redistributes the same amount of reviewer attention more thinly across more items.

The more fundamental question is whether the review process itself needs to change, not just what reviewers look for. There are a few directions worth considering.

Shift quality gates earlier in the process. The most effective teams I have observed are investing in the conditions that make AI output better before it reaches review - not just catching problems at review. This means shared prompt libraries that encode your conventions, CLAUDE.md files and equivalent context that tell AI tools how your system works, and standards documents that developers use when prompting rather than after generating. Code that arrives at review already aligned with your patterns requires less corrective scrutiny.

Use AI to review AI. This sounds circular, but it is not. Running a PR through an AI review before it goes to a human reviewer catches a significant proportion of the most visible issues - the ones that a thorough reading would find, but that the developer who wrote the code has stopped seeing. The human review then focuses on what AI cannot catch: context, architectural judgement, the institutional knowledge about why a particular approach was chosen and why the apparently simpler alternative was deliberately rejected.

Restructure review authority, not just review thoroughness. Some teams are moving toward a model where AI-generated code in high-risk areas requires a different reviewer profile than AI-generated boilerplate. This is not about distrust - it is about matching scrutiny to risk. An endpoint that touches financial data or user authentication should be reviewed by someone with specific domain knowledge of your security model, regardless of how it was generated. A test fixture probably does not need that level of attention. Making this distinction explicit reduces the total review load while concentrating attention where it matters.

PR size as a signal - not just a norm

One of the most useful indicators I have found for how well a developer is using AI assistance is the size and coherence of their pull requests.

An engineer using AI well does not generate an entire feature with a single prompt and submit what comes back. They decompose the problem first - into components, layers, and concerns - and use AI assistance on each piece while maintaining their own understanding of how the pieces fit together. Their PRs reflect that decomposition: reviewable in 30 minutes by someone unfamiliar with the specific context.

A large, monolithic PR that covers multiple concerns and arrives fully formed is often evidence that the thinking happened inside the AI tool rather than in the engineer’s head first. The problem is not the size per se - it is what the size signals about how the code was produced and whether the engineer can explain the decisions in it.

Treating PR size as a coaching signal rather than a bureaucratic limit changes how you have the conversation. The question is not “why is this PR so big?” It is “walk me through how you approached this - what did you think through before you started generating?” The answer tells you far more about whether the engineer has understood what they have built.

The review bottleneck is a leadership problem, not a reviewer problem

The deepest issue with the code review crisis is that it is being experienced at the reviewer level - as time pressure, cognitive load, and professional frustration - but it is caused and sustained at the leadership level by decisions about resourcing, norms, and what success looks like.

If your senior engineers are spending an increasing proportion of their week on review, and that time is coming from building, mentoring, or architectural work, that is not a reviewer problem. It is a resource allocation decision that leadership has made by default rather than intentionally.

Review is not overhead. It is engineering work. If a sprint will generate 25 pull requests, someone needs to budget the time to review 25 pull requests. That time needs to be in the sprint, counted and protected, not expected to materialise from the edges of the delivery schedule.

The teams that are managing this best have made review explicit in their planning: expected review load per sprint, named reviewers with time allocated, a clear escalation path for PRs that require domain expertise the primary reviewer does not have. This is not complicated. It is just a decision that has to be made and enforced.

The question most teams are not asking

The code review research points to something that goes beyond process: Qodo’s 2025 State of AI Code Quality report found that teams with AI review in the loop see quality improvements of 81% compared to 55% for equally fast teams without AI review. Adding structured review - even AI-assisted review - to AI-generated code produces a materially better outcome than generating code quickly and hoping for the best.

But the organisations capturing that 81% figure are doing something specific: they have decided that review is a first-class engineering activity, that AI-generated code is not production-ready by default, and that the speed gain from generation needs to be balanced by investment in validation.

Most teams have not made that decision explicitly. They have let the generation speed increase while hoping the review process will absorb the difference. It does not. And the evidence is accumulating in their incident logs, their defect rates, and the quiet exhaustion of the senior engineers holding the quality bar.

The question worth asking is not “how do we review faster?” It is “how do we create conditions in which the code that arrives at review requires less remedial scrutiny?” That question points back to developer practice, to the standards encoded in your AI tooling setup, and to the culture in which engineers decide what to submit and what to work on further first.

Research cited in this article

Data in this article draws on CodeRabbit’s State of AI vs Human Code Generation Report (470 open-source PRs analysed), DX’s Q4 2025 AI Engineering findings, and Qodo’s 2025 State of AI Code Quality Report. The AI Code Review Checklist referenced in this piece, along with the full framework for adapting your review process, is in The AI-Ready Engineering Team.

Russell Ward is an engineering leader and CTO with over 20 years’ experience building and scaling software engineering teams globally. He writes about engineering leadership, AI adoption, and distributed teams. Find him on LinkedIn.

The Code ReviewCrisis