The Day Code Became Free | Arthur Hertweck

The Feature That Worked

Last fall, I watched a team ship a feature in a single afternoon. A developer opened their AI assistant, described the requirement in a few sentences, and the thing generated a complete implementation; API endpoint, database migration, validation logic, tests. The developer reviewed it, made a few tweaks, CI went green, merged before end of day. Management was thrilled, the sprint velocity chart looked like a hockey stick.

Three weeks later, the feature started dropping transactions under load. Not often enough to trigger alerts, but enough that customers noticed. A senior engineer pulled up the code to diagnose the problem and found herself staring at something she couldn't reason about. Not because it was poorly written, by most surface metrics, it was clean. But nobody on the team had a mental model of what it was doing or why it was structured the way it was. The generated code had made a series of reasonable-looking decisions about connection pooling, retry logic, and error handling that interacted badly under concurrency. Each decision was defensible in isolation, but together they formed a system nobody had designed.

It took four days to diagnose what had taken four hours to build.

I'd love to tell you this is a one-off, it isn't. Variations of this are playing out across thousands of engineering organizations right now. And most of the industry's response has been to focus on making the generation part even faster.

That response is understandable, but it's also exactly wrong.

Something Is Off

Let's start with what's true. AI code generation is a genuine technical achievement. GitHub reports that Copilot users accept roughly 30% of suggestions and write code 55% faster. Developers are building things in hours that used to take days. The velocity gains are real, and dismissing them would be dishonest.

But something doesn't add up. If we're generating code faster than ever, why are teams reporting that their systems are getting harder to change? Why is code churn (the rate at which recently written code gets rewritten or deleted) increasing in the Copilot era, as GitClear's 2024 study documented? Why are senior engineers spending more time reading and less time writing, yet feeling less confident about their systems than they did two years ago?

A senior staff engineer put it to me this way: "I used to spend 70% of my time writing code and 30% thinking about code, now it's reversed. Whether that's the 'comprehension-constrained era' or just the natural evolution of a maturing industry, I honestly don't know. But the problems described are not theoretical, they're my Tuesday."

That last line stayed with me.

The problems we're seeing don't match the problems we expected. We expected AI code generation to create quality issues such as sloppy code, security vulnerabilities, obvious bugs. Some of that happened, but the deeper problems are structural. They're about comprehension, specification, and judgment: the activities that surround code generation, not the generation itself.

To understand why, we need to revisit something the industry has known, and conveniently ignored, for forty years.

The Thing Brooks Told Us in 1986

In 1986, Fred Brooks published "No Silver Bullet," one of the most cited papers in software engineering. His central argument was a distinction between two kinds of difficulty in software development:

Accidental complexity: the difficulty imposed by our tools, languages, and processes. Slow compilers, manual memory management, clumsy version control. Real problems, but artifacts of the current state of tooling, not inherent to the work itself.

Essential complexity: the difficulty of deciding what the software should do, how it should behave in edge cases, how it interacts with other systems, how it maps to the actual messiness of the real-world domain it serves. This complexity doesn't go away with better tools. It is the work.

Brooks argued that most of the remaining difficulty in software was essential, not accidental. The hard part was specification, the deciding precisely what to build and how it should behave, not the generation, but the act of expressing that decision as code.

He was right, the industry acknowledged he was right, and then it organized its entire operational infrastructure as if he were wrong.

Think about how we run software organizations. We measure productivity in story points completed, pull requests merged, lines of code committed. We estimate projects in developer-days of coding effort. We gate quality primarily through code review, by reading the generated artifact rather than validating the specification that produced it. We hire engineers by testing their ability to generate code under time pressure. We define seniority partly by the ability to produce more code, faster, across more of the system, with less re-work. We build entire career ladders around "10x developers," a concept that implicitly assumes generation speed is the primary variable.

Even our tooling tells the story; IDEs are optimized for writing code, version control tracks changes to code, and CI/CD pipelines validate code. The entire feedback loop is organized around the artifact rather than the intent.

This isn't because anyone sat down and decided specification didn't matter. It's because generation was the visible, measurable, time-consuming activity. It was the thing that felt like the work. And for decades, generation was slow enough, and correlated enough with specification effort, that optimizing for generation was a reasonable proxy for optimizing for the whole process. If you had to type every line, you were at least forced to think about every line.

The bottleneck was fake, but the forcing function was real. AI code generation removed the forcing function without resolving the underlying problem.

The Collapse of Pretense

Here is the thesis of this series, stated plainly:

AI didn't shift the bottleneck from generation to specification, the bottleneck was always specification. AI collapsed the pretense that generation was where the difficulty lived.

This distinction matters and I think it matters more than it might seem at first. If the bottleneck had genuinely shifted, we'd be facing a new problem and could engineer our way through it with new tools and processes, the way we always have, but that's not what happened. What happened is that a load-bearing fiction, the fiction that generation was the core constraint, was removed from the system and now the system is exposed.

The industry has almost no infrastructure for the thing that actually matters. We don't have good ways to measure specification quality and we don't systematically train engineers in specification skills. Our processes don't distinguish between "code that does something" and "code whose behavior is fully understood and intentionally designed." We have no equivalent of CI/CD for intent, no automated way to verify that what was built matches what was meant, at the level of system behavior rather than test cases.

This is what Andrej Karpathy inadvertently named when he coined "vibe coding": the practice of generating code by feel, accepting what looks right, and moving forward without deep comprehension. It was meant as a lighthearted description of a new workflow, but it's actually a precise diagnosis of what happens when you remove generation effort without replacing it with specification discipline. You code by vibes because there's no longer a structural reason to do anything else.

And here's the part that makes this urgent rather than merely interesting, it's bimodal. AI doesn't uniformly degrade or improve engineering practice, it amplifies whatever was already there. Teams with strong engineering culture ie: where specification was already valued, where system understanding was already prioritized, where code review meant more than syntax checking; those teams are genuinely shipping better, faster work. They had the specification discipline already and now they have faster generation too.

Teams without that culture are drowning. More code that solves less, systems that nobody understands, technical debt accumulating at a rate that makes the pre-AI era look quaint. Same tool, opposite outcomes. The variable isn't the AI, it's the specification capability of the humans directing it.

I'm still very much in the experiment phase of figuring out what "specification discipline" looks like in practice and I'll get into concrete approaches later in this series, but the pattern is already clear enough to name.

"But Isn't This Just CI/CD Again?"

This is the strongest counterargument, and it deserves a real answer.

Software development has been through this before. Compilation used to be slow and expensive; now it's free. Deployment used to be a manual, terrifying process; now it's automated. Infrastructure provisioning used to take weeks of procurement; now it takes minutes on a cloud console. Each of these was a genuine revolution, none of them created a crisis. The industry adapted, absorbed the efficiency gains and moved on. Why should code generation be different?

Because those revolutions automated peripheral activities. Compilation, deployment, and infrastructure are things that happen around the core productive act of software development. Necessary, often painful, but they don't constitute the fundamental work. Making compilation faster didn't change what it meant to be a software engineer, it just removed a tax.

Code generation is different, because the generation is the core productive act; or at least, it's the activity the industry treated as the core productive act for decades. When you automate the thing you organized your entire profession around, the consequences are categorically different from automating the things around it.

To summarize: the correct historical parallel is not CI/CD, it's manufacturing robotics.

When industrial robots automated the core productive act of manufacturing, it didn't just make factories faster, it restructured entire industries and redefined what it meant to work in manufacturing. The valuable skills shifted from manual dexterity and physical endurance to system design, maintenance, quality assurance, and process engineering. Roles didn't disappear; they transformed. But that transformation was genuine and disorienting, and the organizations that treated robotics as "just faster assembly" got it catastrophically wrong. They kept optimizing headcount on the line while their competitors redesigned the entire production system around what automation made possible; such as new quality paradigms, new roles, new ways of thinking about what the factory was actually for.

I think software engineering is in the early stages of an equivalent transformation. The organizations treating AI code generation as "just faster typing" will be the ones struggling to understand why their velocity metrics look great and their systems are falling apart.

What This Means

If this analysis is right, and the rest of this series will make the case in detail, then the software industry is facing three structural challenges, not one:

The Specification Bottleneck. The thing that was always the hard part is now the only part, and we have almost no tooling, process, or training infrastructure for it. We need to build that infrastructure, and it looks nothing like what we have.

The Local-to-Global Coherence Gap. AI generates excellent code at the local level; individual functions, single files, isolated features. But software systems are defined by how their pieces interact, and AI has no persistent model of whole-system coherence. As generation scales up, the gap between local correctness and global coherence widens.

The Trust Calibration Impossibility. Engineers must simultaneously trust AI output enough to achieve velocity gains and distrust it enough to catch the failures that matter. There is no stable equilibrium here. Vigilance degrades with automation; this is not a character flaw, it's a well-documented property of human cognition. We need structural solutions, not exhortations to "review carefully."

And amplifying all three: a Jevons Paradox dynamic. As generation becomes cheaper, we don't generate the same amount of code more efficiently, we instead generate much more code, which makes all three problems worse.

But here is where I want to be clear: this is not a pessimistic diagnosis. For the first time, we can see the actual structure of the problem. For forty years, the essential difficulty of software was obscured by the accidental difficulty of generation. Now the accidental difficulty is being stripped away, and what remains is the real work. Engineering doesn't disappear in this world, it's elevated. The skills that always mattered most: the ability to specify precisely, to think in systems, to exercise judgment about tradeoffs, to build shared understanding across a team; those become the primary skills rather than secondary ones.

I don't think that's a crisis, I think that's the first honest look at what the work actually is. It's a problem that we've danced around, but never approached head on.

The Road Ahead

This is the first piece in a series called The Specification Age. The title is a claim: that we are entering an era where the specification of intent, not the generation of code; is the primary activity, primary bottleneck, and primary skill of software engineering.

The next piece maps the three root causes in detail and traces them to their structural origins. From there, the series addresses the expertise pipeline crisis, the accountability vacuum, the measurement problem, and the organizational changes required to navigate this transition. It culminates in what I'm calling The Curation Thesis; a concrete framework for what engineering becomes when generation is free and specification is everything.

I want to be honest about uncertainty here, AI comprehension is improving rapidly. Some of what I describe as structural constraints may turn out to be transitional gaps that better models resolve. I'm not fully certain where those lines will land and I'll say so where I think that's the case, but even transitions require navigation. The organizational problems, how we hire, how we measure, how we train, how we define the work, don't auto-resolve with better models. Waiting for the "right time" to address them will probably result in them never getting addressed, they require deliberate redesign.

If you're a senior engineer feeling a shift you can't quite name, the sense that you're working harder on different things, that the hard part of your job is getting harder while the easy part gets automated, that the metrics say everything is fine but your instincts say otherwise; this series is an attempt to name it precisely.

The bottleneck didn't shift, the pretense collapsed, and what's left is the real work.