A Load-Bearing Principle

Software testing rests on a principle so foundational that it's easy to forget it exists: the thing you test against cannot be derived from the thing you're testing. It's a principle the industry could afford to treat loosely when humans wrote both code and tests. AI-assisted workflows are quietly making it load-bearing again, and mostly violating it.


The Oracle Problem

The principle has a name. In testing theory, the oracle is the source of truth that tells you what the correct output should be, and it must be independent of the system under test. When you write a test that asserts calculateTax(100) === 7.25, the oracle is your understanding of the tax rules, not the function itself. The test has value precisely because the expected value comes from a different source than the implementation.

This shows up everywhere in formal verification: you can't prove a system correct by using that same system as your reference. The specification has to be independent. It's not a nice-to-have, it's the thing that makes verification meaningful at all.

When AI generates both the code and the tests, this principle breaks. Not sometimes, not in edge cases, but structurally. The same model that decided how to implement a function is the same model deciding what that function's correct behavior looks like. Its blind spots in generation are its blind spots in verification. If the model doesn't understand a subtle concurrency issue well enough to avoid it in the code, it doesn't understand it well enough to test for it either. The failure modes are correlated, and correlated failure modes are precisely the thing independent verification exists to catch.

Two observations from practitioners have been pointing at this. Hillel Wayne has articulated it sharply: AI-generated tests tend to have low mutation scores, meaning that when you inject small faults into the code, the tests don't catch them. Emily Bache has framed the same pattern from a different angle: AI-generated tests achieve structural coverage without semantic coverage. You can hit 95% line coverage and still have a test suite that doesn't know what the software is supposed to do, because the tests were derived from the implementation rather than from a specification of intent. That's a circular reference dressed up as quality assurance.


Grading Your Own Homework

"Write tests for this function" is one of the most common AI-assisted development prompts, and it contains the entire problem in six words. What it actually asks the model to do is: look at this implementation, infer what it's supposed to do, and then verify that it does the thing you just inferred. The tests will always pass, because the model is testing the code against its own interpretation of the code.

The distinction between "what the code does" and "what the code should do" is the entire point of testing. AI collapses that distinction. Not out of malice or incompetence, but because the model has no independent access to the specification. It only has the implementation, and it reverse-engineers intent from structure. When the structure contains a bug, the inferred intent contains the same bug.

I think the most useful analogy here is grading your own homework, but at industrial scale. A student who grades their own work will consistently rate their answers as correct, not because they're dishonest, but because the same understanding that produced the wrong answer will evaluate it as right. The errors are invisible from inside the system that generated them. This is why we have independent grading in education, independent audits in finance, independent verification in engineering.

And yet we're building development workflows that do exactly this, and we're calling the results "tested."


Confidence Theater

Here's where it stops being an abstract concern and starts being dangerous. The artifacts of AI-generated testing look exactly like the artifacts of real testing. Coverage reports show 90%+ and the CI pipelines go green. Code review sees tests passing alongside the implementation, which actually increases reviewer confidence, because the presence of tests signals rigor even when the tests are tautological.

I think this is Goodhart's Law applied to software quality: when coverage becomes the target, it ceases to be a useful measure. Organizations that gate deployments on coverage thresholds now have an incentive structure where AI can trivially satisfy the gate without satisfying the purpose the gate was supposed to serve.

The failure mode is subtle and delayed. The code deploys, the tests pass, and the system works under all the conditions that both the code and the tests considered. The problem shows up under the conditions that neither considered, which are by definition the conditions the model's training didn't prepare it for. Race conditions, unusual input distributions, interaction effects between components, edge cases that depend on domain knowledge the model doesn't have. The code doesn't handle them; the tests don't test for them; and the coverage report provides no signal that anything is missing.

Consider a concrete scenario: AI generates a caching function with a subtle race condition in its invalidation logic. The same model generates tests that verify the cache returns correct values on read, correctly invalidates on write, handles null keys, times out properly. The race condition isn't tested because the model that introduced it doesn't see it as a possibility. The tests are comprehensive by every structural metric and completely blind to the actual defect. You won't find it until production traffic hits a concurrency pattern that neither the code nor the tests anticipated.

This is confidence theater: the performance of quality assurance without its substance.


It's Not That the Tests Are Bad

AI can write perfectly competent tests, syntactically correct, structurally thorough, covering branches and edge cases that a human might forget. The problem isn't quality in the conventional sense, it's independence.

Early research on AI-generated test quality points in the same direction. Meta's TestGen-LLM work focused on improving test coverage through AI generation, and while coverage increased, the quality of those tests as fault-detectors is a separate question. JetBrains Research and others have examined mutation survival rates for AI-generated test suites and found that the tests consistently perform worse at catching injected faults than human-written tests targeting the same code. The tests look good, they cover the branches, they just don't catch bugs at the same rate, because they were generated from the same model of the code that the bugs emerged from.

This isn't a problem that better models straightforwardly solve, because it's not a capability gap, it's a structural property. A more capable model will write more sophisticated tests that still share its own blind spots. Capability and independence are different axes. You can move arbitrarily far along one without moving at all along the other.


The Specification Gap, Again

All of this traces back to the specification bottleneck from Piece 1. The reason AI collapses the distinction between "what the code does" and "what it should do" is that the specification, the independent statement of intent, usually doesn't exist in a form the testing process can reference.

When a human developer writes tests, the oracle is their understanding of the requirement; messy, incomplete, sometimes wrong, but independent of the implementation. They read the spec, they talk to the product owner, they think about edge cases from domain experience, and they write tests that encode that separate understanding. The tests might be wrong, but they're wrong for different reasons than the code is wrong, and that independence is what gives testing its power.

When AI writes both code and tests from the same prompt, the only oracle is the prompt itself. If the prompt says "implement a function that calculates tax" and says nothing about rounding behavior, tax-exempt categories, or regional variations, the code will make silent assumptions and the tests will validate those same silent assumptions. There's no independent source of truth to push back, no second perspective to introduce the kind of productive disagreement that catches errors.

The testing tautology is, at its root, a specification problem. If you had a clear, independent specification, the tautology breaks. You test the code against the spec, not against the code's own self-image. The problem isn't testing, it's the absence of something to test against.


Breaking the Loop

I'm still working through what the full solution looks like, but I think the interventions that matter share a common principle: they reintroduce independence into the verification process. Some of these I've seen work in practice, some I'm still experimenting with, and I'll say which is which.

Spec-first testing. Write test specifications, in natural language, structured requirements, formal invariants, whatever fits your context, before generating code. Then generate tests against the spec, not against the implementation. This sounds obvious, but it inverts the workflow most teams are using. The spec becomes the oracle, and the spec exists before the code does, so it can't be derived from the code's assumptions. I've seen teams adopt this with good results, though the discipline of writing specs before code is exactly the muscle the industry has let atrophy.

Property-based testing. Instead of testing specific inputs and outputs, define invariants that must hold regardless of implementation. "The output of sort should always be in ascending order and contain exactly the same elements as the input." Property-based tests are harder for AI to accidentally make tautological, because they test properties rather than behaviors. You can ask "what must always be true about this function?" and get tests with genuine verification power, because the properties come from the domain, not from the code. This is the approach I've had the most direct success with; property-based tests combined with AI generation is one of the few pairings where the AI's speed genuinely enhances rather than undermines quality.

Mutation testing as meta-validation. Run mutation testing on AI-generated test suites. Mutation testing injects small faults, changing a > to >=, flipping a boolean, removing a line, and checks whether the tests catch them. If mutants survive, the tests aren't validating behavior. This doesn't fix the tautology, but it exposes it, which is the necessary first step. Think of it as a test for your tests, and importantly, one that the AI can't game, because the mutations are generated independently of both the code and the tests.

Adversarial test generation. Use a separate model instance, or a separate prompting session with deliberate adversarial framing, to try to break the code. "Here is a function. Your job is to find inputs that cause it to fail, behave unexpectedly, or violate its stated contract." This reintroduces the independence that single-session generation destroys. The adversarial framing matters; you're explicitly asking the model to work against the code rather than to validate it.

Contract testing for integration boundaries. Define contracts between services independently of either service's implementation. The contract says "service A will send this shape of data, service B will accept this shape of data," and both sides are tested against the contract rather than against each other. This is well-established practice that becomes more important when AI is generating both sides of an integration, because the correlated-failure-mode problem is amplified across service boundaries.


The Principle

The interventions are varied, but the principle underneath them is singular: the oracle must be independent of the generator. The thing you test against cannot be derived from the thing you're testing.

This is not a new principle, it's as old as formal verification. What's new is that AI collapses the natural variation between writing code and writing tests, and what used to hold implicitly now has to be enforced structurally.

The good news, if you want to call it that, is that this connects directly to the specification bottleneck in a way that makes both problems reinforce the same solution. If you invest in better specifications, you solve the testing tautology as a side effect, because specifications give your tests an independent oracle. And if you invest in breaking the testing tautology, you end up building specification infrastructure as a prerequisite.

The work is the same work. It's specification all the way down.