Comprehension Debt: The Liability Nobody Is Accounting For

A New Kind of Not-Knowing

Every engineering organization has code nobody understands. This is not new. For decades, we called it "the legacy system Dave wrote before he left," and we managed it with a combination of institutional knowledge, careful archaeology, and the quiet understanding that certain parts of the codebase were load-bearing mysteries. It wasn't ideal, but it worked because the incomprehension was always temporary: somebody had understood that code once. Dave understood it when he wrote it. His teammate understood it during review. The knowledge existed in the system at some point, even if it had since dissipated.

What's happening now is qualitatively different, and I think the industry is underestimating the difference.

When an engineer uses an AI assistant to generate a feature, something subtle breaks in the knowledge chain. The code arrives fully formed, syntactically clean, functionally correct in the narrow sense that it passes the tests the AI also helped write. But the design rationale, the why of every structural decision, exists nowhere. Not in the developer's head, not in documentation, not in the AI's memory. The intent evaporated at the moment of generation, and what remains is an artifact with no provenance.

I'm calling this comprehension debt, and it's distinct from technical debt in ways that matter.

Technical debt is a conscious tradeoff. You know you're cutting a corner, you accept the future cost, and in theory you track it. The metaphor works because the borrower understands the terms. Comprehension debt is different: it's the growing gap between what a system does and what any human understands about why it does it that way. You don't choose to take it on, you don't know how much you're accumulating, and there's no ledger. It accrues silently every time someone merges code they can't fully explain.

GitClear's 2024 study documented that code churn, the rate at which recently written code gets rewritten or deleted, increased measurably in the Copilot era. That finding matters, but I think churn is the visible symptom of something less measurable. Churn tells you code is being replaced; it doesn't tell you why. And increasingly, the why is: nobody understood it well enough to modify it, so they regenerated it. The replacement becomes the path of least resistance when comprehension fails, and comprehension fails more often when nobody had it in the first place.

There's a distinction the industry doesn't have a category for yet. Legacy code that was understood once is at least a recoverable problem. You can read the artifact and reconstruct what it does functionally. Whether you can figure out why depends on whether someone left a trail: a design doc, a commit message with context, an ADR, anything that captured the reasoning behind the decisions. Sometimes that trail exists and sometimes it doesn't, but the possibility was there. Someone made deliberate choices, and those choices left traces in the structure even when the documentation didn't survive.

Code that was never understood by a human doesn't have that property. There are no deliberate choices to trace, no reasoning to reconstruct, because the decisions were made by statistical defaults rather than intentional design. It's not an information-retrieval problem, it's an information-absence problem. The intent was never in the system. You can't retrieve what was never stored.

The Cognitive Mechanism

There's a concept in cognitive psychology called the generation effect: information you produce yourself is remembered more durably than information you passively receive. If you work through a proof yourself, you remember the logic better than if someone hands you the completed proof. If you write a function from scratch, reasoning through each decision, the structure and rationale lodge in your working model of the system in a way that reading someone else's function doesn't achieve.

This isn't a productivity hack, it's a well-documented property of human memory. The act of generation isn't just a means of producing code; it's the mechanism by which engineers build and maintain mental models of their systems. When you spend an afternoon working through connection pooling in a concurrent service, you come out the other side understanding not just the code you wrote, but the constraints that shaped it, the alternatives you rejected, and the failure modes you anticipated. That understanding persists, and it informs the next decision you make in that system.

AI-assisted development weakens this mechanism in proportion to how much generation it handles. I want to be precise here, because the claim isn't that AI makes engineers stupid, the claim is structural. When generation is externalized, the cognitive process that builds comprehension is bypassed. Not degraded, bypassed. The engineer still reads the code, still reviews it, still approves it. But reading and generating are different cognitive activities with different retention outcomes, and the industry is treating them as interchangeable.

This is why the "just review it carefully" response is insufficient. An author holds the full decision tree in their head: the paths explored, the tradeoffs weighed, the failure modes anticipated. A reviewer sees only the path that was taken. When the author is an AI, even inference about unchosen paths breaks down, because the AI's "decision process" doesn't map to human reasoning in a way that makes alternatives recoverable.

How It Compounds

The insidious thing about comprehension debt is that it compounds in ways that technical debt doesn't.

Each AI-generated component adds code that works but that nobody on the team can fully explain. In isolation, this is manageable; you can treat any single module as a black box and move on. But systems aren't collections of isolated modules. They're webs of interaction, and when you need to understand how component A's retry logic interacts with component B's connection pooling under component C's load balancing, the comprehension debt in each component doesn't add, it multiplies. You need to understand all three, and their interaction, and nobody understands any of them deeply enough to reason about emergent behavior.

Six months after that feature shipped, an on-call engineer at 2am is reverse-engineering intent from an artifact produced by a stochastic process. The code is reasonable, which makes it harder, not easier. If it were obviously bad, the problem would be localized. Reasonable code that nobody designed is harder to debug than bad code that someone understood, because the failure mode isn't in any single decision, it's in the interaction between decisions that nobody made together.

Onboarding becomes archaeology. New engineers traditionally build mental models by reading code and having experienced teammates walk them through the reasoning. But what happens when the experienced teammate didn't write the code and can't explain why it's structured the way it is? Remove the assumption that someone authored the code with articulable intent, and onboarding degrades from education into excavation.

And then comes the dependency loop that should worry people more than it does: the organization becomes dependent on the AI tool to understand its own codebase. When nobody can explain a component, the default response is to ask the AI. But the AI's explanation is a plausible reconstruction, not a factual account; it's generating an interpretation, not recalling design intent, because there was no design intent to recall. The organization is using one stochastic process to interpret the output of another stochastic process, and treating the result as understanding. This is a dependency with no fallback.

The Auditability Collapse

When a production incident occurs in a high-comprehension-debt codebase, debugging changes character in ways that directly affect reliability.

In a traditional incident, the first question is: what was this code trying to do? With human-authored code, even poorly documented code, you can usually answer that by reading the commit history, talking to the author, or tracing logic through a coherent design. The design may be flawed, the implementation buggy, but there's an intent to reconstruct.

With AI-generated code, that question becomes speculative. You're not tracing through logic someone designed; you're reverse-engineering intent from an artifact that has no intent. The commit message says "implement retry logic" but doesn't explain why the retry interval was chosen, why the backoff curve looks the way it does, why the circuit breaker threshold is set where it is. Nobody chose those values through a reasoning process that can be interrogated, they were generated.

This degrades incident response concretely. Mean time to resolution increases because you're not finding a bug in a design, you're trying to infer a design from code that was never designed. Root cause analysis becomes speculative because you can't distinguish between "this was an intentional tradeoff that went wrong" and "this was an arbitrary decision that happened to work until it didn't." Post-incident reviews lose their educational value because the lesson isn't "we made a wrong choice," it's "something generated this and none of us know why."

Charity Majors has been arguing for years that observability is more critical than most organizations realize. Comprehension debt makes observability harder at the source. You can instrument everything, you can have perfect telemetry, but if nobody on the team can form a hypothesis about why the system is behaving a certain way, the telemetry gives you symptoms without etiology. You can see that latency spiked at 2:17am, you can see which component caused it, but you can't explain why that component behaves differently under those conditions, because the behavior was generated, not designed.

Why Existing Solutions Don't Reach

The natural response to comprehension debt is to lean harder on existing quality practices. More rigorous code review, better documentation, stricter standards. I think these are necessary but insufficient, because each of them assumes something about the human-code relationship that AI-assisted development undermines.

Code review assumes the author understands the code. The traditional review is an asymmetric dialogue: the author has deep context, the reviewer applies fresh eyes. When AI generates the code, both parties are interpreting. Neither holds the design rationale. The review becomes two people reading code they didn't write, negotiating which interpretation seems more plausible. That's a reading comprehension exercise, not a design review.

AI-generated documentation restates what the code does, not why. When you ask an AI to document a function it generated, it produces a description of behavior: inputs, outputs, steps. What it doesn't produce, because it can't, is the reasoning that led to this structure rather than another. Why was the retry count set to three? Why does this component handle its own connection pooling instead of using the shared pool? These are design decisions, and when the design was generated rather than reasoned through, the rationale doesn't exist to be documented.

Comments and commit messages thin out. When a developer writes code manually, the friction of the process forces engagement with decisions. That engagement produces artifacts: comments explaining tricky logic, commit messages describing the approach, PR descriptions walking through the design. When code is generated, these artifacts become perfunctory. The commit message says "add retry logic" because that's what was requested. The design decisions, such as they are, are compressed into the prompt, and prompts are almost never preserved as project documentation.

Test coverage doesn't equal comprehension coverage. You can have 100% test coverage on code nobody understands. The tests verify that the code does what it does, which is a tautology when the tests were also generated from the same specification, or lack thereof. What testing doesn't verify is whether the code does what it should do in the full space of scenarios the team needs it to handle. That verification requires understanding, and understanding is exactly what's missing.

The Intervention: Comprehension Gates

I don't think comprehension debt is inevitable. I think it's the predictable result of organizations optimizing for generation speed without building corresponding infrastructure for comprehension. The generation infrastructure is excellent, we're better at producing code than we've ever been. The comprehension infrastructure barely exists.

Here's where I'm at in terms of concrete interventions, still experimental, but grounded in what I've seen work:

Comprehension gates in the review process. Before a PR is approved, the submitter should be able to explain the approach without looking at the code. Not recite it line by line, but whiteboard the design: why this structure, what alternatives were considered, what failure modes are handled. If the submitter can't do this, the code isn't ready for review regardless of its test status. This is actually fast for code the developer genuinely understands; it's only slow for code they don't, and that's the signal it's meant to detect.

Specification-first workflow. Write the spec before generating the code. I mean a real specification: the constraints, the failure modes, the interaction with other components, the design rationale for the approach you're about to take. Then generate the code from the spec, and review the spec independently of the code. The spec becomes the durable artifact that preserves intent; the code is the implementation of that intent, replaceable and re-generatable. This inverts the current flow, where the code is the primary artifact and the spec, if it exists at all, is an afterthought.

Comprehension audits. Periodic exercises, I'd suggest quarterly, where engineers explain AI-generated components to each other. Not as a gotcha, but as a diagnostic. Where the team can't explain its own systems, comprehension debt has accumulated and needs to be addressed. These audits also serve an educational function; the act of preparing to explain a component forces the kind of deep engagement that builds real understanding.

Measuring comprehension debt. This one is harder, but worth pursuing. Track the ratio of code any team member can confidently explain versus total codebase. Monitor time-to-resolution for incidents in different parts of the codebase as a proxy for comprehension depth. None of these are perfect metrics, but they're better than the current approach, which is to not measure comprehension at all.

Prompt preservation. Treat the prompts and specifications used to generate code as first-class project documentation. When an AI generates a significant component, the prompt, the context, and the iterative conversation that produced it should be captured alongside the code. This doesn't solve comprehension debt, but it gives future engineers a starting point for understanding intent that they currently don't have.

I want to be honest that I'm still in the experiment phase with most of these. The comprehension gate is the one I've seen produce the most immediately visible results; teams that adopt it report catching significant design gaps before they reach production. The others are earlier in validation. But the direction feels right, and the alternative, continuing to accumulate comprehension debt without measuring or addressing it, feels clearly wrong.

The Structural View

Comprehension debt derives directly from the first two root causes I described in Piece 2: the Specification Bottleneck and the Local-to-Global Coherence Gap. When specification is weak, ie: when code is generated without deep engagement with intent, comprehension debt accrues at the individual component level. When coherence is missing, ie: when locally correct components interact in ways nobody designed, comprehension debt accrues at the system level. The two multiply each other.

This isn't about AI being bad at what it does. It's quite good at what it does, which is generating locally correct code from specifications. The problem is that organizations are consuming that capability without building the infrastructure to maintain comprehension of what's being generated. We built the generation engine and forgot to build the understanding engine.

The parallel to other engineering disciplines is instructive. When CAD automated drafting in mechanical engineering, organizations didn't stop requiring engineers to understand their designs. They created new review processes, new documentation standards, new ways of ensuring the humans using the automation understood the systems they were producing. The drafting was automated; the comprehension was not.

Software engineering needs to make the same distinction, and make it soon. Every month that passes, the comprehension debt grows, the organizational dependency on AI-as-oracle deepens, and the cost of eventually addressing it increases. The organizations that build comprehension infrastructure now, while the debt is still manageable, will have a significant structural advantage over those that wait.

The bottleneck was always specification. Comprehension debt is what happens when you generate without specifying, and it's accumulating faster than most organizations realize.