Jevons Paradox: Why Cheaper Code Makes Worse Software

The Coal Problem

In 1865, an English economist named William Stanley Jevons noticed something that should have been impossible. James Watt's improvements to the steam engine had made coal dramatically more efficient; you could extract more work from less fuel. The logical prediction was that England would consume less coal. The opposite happened. As coal became more efficient, its uses multiplied, total consumption exploded, and the country burned more coal than ever before.

Jevons' observation has held across nearly every resource efficiency gain in modern history. Make engines more fuel-efficient, people drive more. Make storage cheaper, organizations hoard more data. Make bandwidth faster, Netflix shows up and consumes a quarter of North American internet traffic. In every case, the efficiency gain was real AND the total consumption increased. Both things are true simultaneously, which is what makes the paradox so durable and so hard to reason about intuitively.

I think software is Jevons' latest case study, and possibly his most consequential one.

The Productivity Illusion

GitHub's data on Copilot tells a clean story: users write code 55% faster and accept roughly 30% of suggestions. The implied conclusion was productivity gain, faster delivery of the same output. But that framing contains an unstated corollary that almost nobody in the discourse acknowledged: those developers aren't producing the same amount of code in less time, they're producing more code in the same amount of time. Or more code in less time. Or both.

This distinction matters enormously and it's easy to miss. A 55% speed increase doesn't mean you ship the same feature by 2pm and take the afternoon off. It means you ship that feature by 2pm and start the next one. The sprint fills, the backlog drains faster, the product manager sees velocity go up and adds more items. The quarterly roadmap gets more ambitious. At no point does anyone decide to produce more code; it's just the natural economic response to a resource becoming cheaper.

Baldur Bjarnason made this argument forcefully: the software industry already had a "too much code" problem before AI entered the picture. Most systems are overbuilt, most features are underused, most codebases contain more code than any single person can comprehend. AI code generation didn't create this dynamic, it accelerated it, removing one of the few natural brakes on code production.

Dan McKinley's "Choose Boring Technology" essay captured the pre-AI version of this constraint. McKinley argued that engineering organizations should be extremely conservative about adopting new technologies, not because new technologies are bad, but because each one consumes a finite budget of organizational attention and maintenance capacity. The constraint on production, the slowness and cost of building things, was itself a form of discipline. It forced tradeoff conversations. It made people ask "do we actually need this?" before investing weeks of engineering time.

When generation becomes nearly free, that question stops getting asked. Not because anyone consciously decides it doesn't matter, but because the economics no longer force it. If building a feature costs four weeks of developer time, you think carefully about whether it's worth building. If it costs an afternoon, you just build it and see what happens. The aggregate effect, across thousands of teams making thousands of these micro-decisions, is an enormous increase in total code production.

Code Is Not an Asset

Here's the counterintuitive core of this piece, the thing that most organizations get structurally wrong: code is not an asset. Code is a liability with occasional asset-like properties.

I know that sounds provocative and I want to be clear about what I mean. The behavior that code produces can be an asset; a feature that delights users, a system that processes transactions reliably, an API that enables a partner integration. But the code itself, the text, the artifact, every line of it carries cost. Maintenance cost, security surface area, cognitive load on every engineer who has to read it, build time, test execution time, dependency management burden, migration complexity when the underlying platform changes. Martin Fowler and the "software entropy" school have been making this argument for decades: code decays. Left unattended, it becomes harder to change, harder to understand, harder to reason about. The natural trajectory of any codebase is toward disorder, and the rate of that disorder is proportional to the volume of code.

This is not a quality argument as even well-written code is expensive to maintain. And clean, well-tested, well-documented code will add to the cognitive surface area of a system. The question is never "is this code good?" but rather "does the value this code produces justify the ongoing cost of keeping it alive?"

When generation was expensive, organizations answered that question implicitly. The cost of writing code acted as a natural filter; if something wasn't worth the human effort to build, it didn't get built. The filter was imperfect, plenty of unnecessary code got written anyway, but it existed. AI code generation removed the filter without replacing it with a conscious decision-making process. The result is that organizations are now producing code whose maintenance cost exceeds its value, and they can't see it happening because their entire measurement infrastructure is oriented around output volume.

The strongest critique of the Jevons framing comes from practitioners, and I think it's partly right. The argument goes: Jevons Paradox is overblown as stated. What's actually happening is a split. Teams with strong engineering culture are shipping better software faster, and teams without it are drowning in AI-generated slop. The paradox framing assumes a uniform effect and the reality is that AI is an amplifier.

I think that's right at the team level and wrong at the industry level. Individual teams with strong discipline absolutely can resist the Jevons dynamic. They can use AI to write the same amount of better code rather than more code of the same quality. But the industry-wide effect is still Jevons, because the distribution of engineering culture across organizations is not uniform. Most teams don't have strong specification discipline; Piece 1 of this series argued that the industry never built the infrastructure for it. So the aggregate outcome is more code, even if a minority of teams are using the efficiency gains wisely. The paradox describes the system behavior, not any individual actor's behavior, and that's what makes it so hard to address.

The Maintenance Trap

Here is where it gets structural. Organizations are now generating technical debt at machine speed while paying it down at human speed.

Think about what the code lifecycle actually looks like when AI can generate code in minutes. But reviewing that code still takes a human reading it carefully, understanding the system context, evaluating design decisions. Testing that code, in the meaningful sense of validating that it does what was intended, still requires a human who understands the specification well enough to know what "correct" means. Maintaining that code when requirements change, when dependencies update, when adjacent systems evolve, still requires a human who can reason about the implications.

Every stage of the lifecycle except generation is still bottlenecked on human cognition. But the generation stage is now producing input to those downstream stages at a rate that humans cannot absorb.

The result is a self-reinforcing cycle that I find genuinely alarming. AI-generated code creates complexity. Complexity justifies more AI tooling, because humans can't manage the complexity unaided. More AI tooling generates more code, which creates more complexity, which justifies more tooling. At no point does the cycle produce a natural stopping point. There is no equilibrium, only escalation.

And the truly insidious part: more code means more code for the next AI to be trained on. If AI models are trained on codebases that are increasingly bloated with unnecessary generated code, the models learn that bloat is normal and they reproduce it. Researchers have documented model collapse in text generation, where models trained on their own outputs degrade over generations. Nobody is tracking the code equivalent.

The empirical picture is thinner than the volume of discourse suggests. GitClear's five-year longitudinal data shows short-term churn rising from 5.5% in 2020 to 7.9% in 2024, with code duplication growing roughly fourfold. METR's 2025 RCT found experienced open-source developers were 19% slower using AI tools, not faster, and mistaken about it by roughly 40 percentage points in their own estimates. Industry analyses of bug density and maintenance cost exist but vary in rigor. What's still missing is a controlled, multi-year study comparing defect density and maintenance hours between codebases with high versus low AI generation ratios. Without that, both sides can still argue from priors, but the priors have gotten harder to defend on the pro-AI side.

The Organizational Incentive Problem

The Jevons dynamic wouldn't be so dangerous if organizational incentives worked against it. They don't, they work with it.

Consider how most software organizations measure developer productivity. Lines of code, pull requests merged, features shipped, story points completed, sprint velocity. Every one of these metrics rewards output volume. An engineer who generates 500 lines of clean AI-assisted code per day looks more productive, by every standard organizational metric, than an engineer who deletes 200 lines of unnecessary code and simplifies an architecture. The first engineer is "shipping." The second is "not producing."

This isn't a new problem; Goodhart's Law has been eating software metrics for decades. But AI amplifies it catastrophically, because the gap between what's easy to measure (output) and what actually matters (value delivered per unit of ongoing cost) has widened enormously. When generation was slow, high output at least correlated loosely with high effort, which correlated loosely with deliberation, which correlated loosely with value. Remove the generation bottleneck and those correlations collapse. High output now correlates with nothing except the availability of an AI coding assistant.

The rational individual choice, generate more, ship faster, get promoted, is producing a collectively irrational outcome, systems that are more expensive to maintain than they need to be, harder to change than they should be, and more fragile than anyone realizes until something breaks. This is a commons problem. The "commons" is system comprehensibility and maintainability. Each individual contribution of unnecessary code degrades it slightly, but no single contribution is enough to trigger alarm. The degradation is invisible until it's acute.

And this is why I think it's important to name the Jevons dynamic specifically rather than framing it as a quality problem: quality problems suggest quality solutions. Better models, better code review, better linting. But this isn't a quality problem. The code that's being generated might be perfectly fine in isolation. The problem is that there's too much of it, and "too much good code" is a concept that most engineering organizations have no framework for reasoning about. You can't lint your way out of a commons problem, what you need is structural intervention.

The Intervention: Treating Code as Cost

I don't think the Jevons dynamic is unstoppable, but I think the interventions that work are organizational and cultural, not technical. They require changing what organizations measure, what they reward, and how they think about the relationship between code and value.

Treat code as a cost center, not an output metric. The single most impactful change an engineering organization can make is to stop measuring code production and start measuring value delivered per line of code maintained. This is harder to measure, obviously, which is why nobody does it. But the alternative, continuing to reward volume, actively makes systems worse. The metric doesn't have to be perfect; even a rough approximation shifts the conversation from "how much did we ship?" to "was it worth shipping?"

Make deletion a first-class engineering activity. Most organizations treat code deletion as cleanup, something you do when you get around to it, never prioritized, never celebrated. This needs to invert. Regularly schedule what I'd call "code diet" sprints, focused entirely on removing unnecessary code, simplifying abstractions, reducing surface area. Measure and celebrate net-negative code changes the way you'd celebrate a new feature. An engineer who removes 1,000 lines of dead code has reduced maintenance cost, reduced security surface, reduced cognitive load, and made the system easier to change. That's at least as valuable as adding a feature, usually more so.

Establish code budgets. Team-level constraints on net code growth per quarter, forcing explicit tradeoff conversations. If your team has a budget of N net new lines per quarter, every addition requires either a justification or a corresponding deletion. This sounds draconian and it is, deliberately. The point isn't the specific number, it's the forcing function. The constraint makes teams ask the question that Jevons dynamics suppress: "is this worth the ongoing cost?"

Track maintenance ratios. For every N lines generated, track how many person-hours of maintenance they consume over six months. This is the data that nobody has and everybody needs. If AI-generated code consumes maintenance at the same rate as hand-written code, the Jevons concern is overblown and the efficiency gain is real. If it consumes more, and I suspect it does because of the comprehension debt problem from Piece 3, then we need to factor that cost into the generation decision. You can't manage what you don't measure, and right now nobody is measuring this.

Apply the "would we write this by hand?" heuristic. This is the simplest intervention and maybe the most powerful. Before accepting a generated feature or component, ask: if we didn't have AI assistance, would this have been worth the engineering time to build manually? If the answer is no, it's probably not worth the maintenance burden of having it in the system. The fact that generation is cheap doesn't make maintenance cheap. This heuristic reconnects the build decision to the value decision that free generation severed.

The key principle underlying all of these: the cost of code isn't in writing it. It never was, really, but now it's undeniable. The cost is in everything that comes after. Reading it, understanding it, changing it, securing it, testing it, explaining it, and eventually, deciding it's no longer worth keeping. Every line of code is a commitment to future cognitive effort, and organizations that don't account for that commitment are writing checks their engineering teams will be cashing for years.

The Amplifier

The previous pieces each named a structural problem. Specification rigor determines whether the code you generate actually does what you intended. Comprehension discipline determines whether your team can understand and maintain the code you've generated. Testing integrity determines whether you can verify the code's behavior against its specification. Each is a problem in its own right, each has its own interventions.

Jevons makes all of them worse, because Jevons means there's more code to specify, more code to comprehend, more code to test, and the rate of increase is accelerating. Address specification rigor, comprehension discipline, and testing integrity, and the Jevons dynamic becomes manageable; you're generating more code, but it's code you understand, code you've verified, code that does what you intended. Ignore them, and Jevons ensures they compound. More code that nobody specified explicitly means more behavior that nobody intended. More code that nobody comprehends means more systems that nobody can maintain. More code that nobody tested meaningfully means more defects that nobody catches until production.

The volume isn't the disease, it's the fever. The disease is an industry that lost its natural brake on complexity and hasn't yet built a conscious one to replace it. Jevons tells us that the brake won't rebuild itself, that efficiency gains don't self-correct, that the economic incentives point toward more, always more. The intervention has to be deliberate, structural, and it has to come from organizations that are willing to measure what matters instead of what's easy.

The good news, if you want to call it that, is that this is a known pattern. Every industry that has experienced a Jevons dynamic has eventually developed the institutional controls to manage it. Energy got efficiency standards and carbon accounting. Data got retention policies and privacy regulations. Software will get its equivalent, the question is whether individual organizations build those controls proactively or wait for the consequences to force them.

I'd rather not wait.