Test Levels

Categories: Programming

I’m planning on starting a series of blog posts discussing different testing methodologies, but before I do that, I need to get a rant off my chest, and it pertains to a topic that is discussed ad nauseum whenever you bring up “testing methodologies.” A topic that is, in my mind, neither very interesting nor very valuable and gets in the way of more useful discussions.

The pointless conversation we keep having

Have you ever been in a meeting and someone starts discussing the relative importance of unit, integration, etc. testing? Have you heard people argue about whether we have enough unit tests? What about too many unit tests and not enough integration tests? What about too many integration tests? Or heard someone argue that we shouldn’t write end-to-end tests because they’re too brittle? I certainly have, and I’m guessing that anybody who has programmed in a professional context for more than a hot minute has as well. I have an opinion about those conversations and every other flavour of “do we have enough or too many of each type of test”:

These discussions are a waste of time and we should stop spending effort and mental energy having them.

Now at the core of those discussions there exists an underlying kernel of valuable topic, but in its typical form, it is a clumsy proxy for the questions that we actually care about. Not because the categories themselves are wrong, but because they’re a low-resolution quantization of something that’s actually continuous, and just like over-quantizing an LLM, it leads to a less useful output.

It’s a gradient, not categories

Here’s the model I want you to hold for the rest of this post. Imagine a single axis. On the far left: a test that exercises one small piece of self-contained logic — a pure function, no I/O, no clock, no network. Setup is trivial because there’s almost nothing to set up. Think of a math operation like add or multiply. On the far right is a test that drives the whole system the way a real user or client would: real database, real network hops, real everything. Setup is a project in itself. Think opening a bank account—you need to validate all the inputs for correctness, spin up your anti-fraud services, check for duplicates, generate signup data that won’t raise any flags within your system (which is effectively committing white-hat fraud against your system), etc.

Isolation-fidelity Axis

Isolation-fidelity Axis

Every test you’ve ever written lives somewhere on the isolation-fidelity axis. You could even extend the line and add user acceptance testing to the right, perhaps syntactical analysis (done by linters, compilers, etc.) on the left. At the end of the day, “unit,” “integration,” and “end-to-end” aren’t three distinct kinds of test, they’re three regions that we’ve loosely labeled on a continuous line pertaining to different amounts of scope. And what defines a module? When does the number of modules connected together hit the critical mass where it ceases to be an integration test and becomes an end-to-end one? What if you only have one module? Is everything a unit test or is everything an end-to-end test? What if you use the actual module instead of a test fake to satisfy a dependency, is it now an integration test? What if that module is really simple?

The borders between them are blurry and overlapping, drawn in different places by different people †. But the fuzziness isn’t really a problem. Categories, even loose ones, help us talk about things efficiently. If you had to delve into the nuances of each test making up a suite to discuss it as a whole, you’d never be able to get anywhere. However, if you treat these categories as fixed regions for prescriptive goals rather than as the fuzzy generalizations they are, then you immediately have a problem: whose definition are you using? You can’t budget across buckets whose edges nobody agrees on. That’s not to mention the even larger issue that doing so often ignores the all-too-critical nuances of “what are you testing” and “with what tools are you testing it” that should be the basis of any discussion regarding testing strategy.

† If you want to see an example of this in practice, just go into a group of software engineers and assert that sociable unit tests are actually misnamed integration tests. In the ensuing discussion, you will probably hear about as many different definitions of where the boundary between “unit test” and “integration test” sits as there are people taking part.

What actually changes as you move along the axis

The reason the axis is useful is that real, discussion-relevant things change as you slide along it.

Setup gets harder. On the left, you call a function and assert on the return value. As you move right, you’re standing up dependencies: a database, a fake third-party service, a message queue, seed data, a way to manipulate the system clocks across multiple machines (if you think I’m joking about this last one, this is one of the things that Jepsen can do). And the cost isn’t linear; the sad paths are usually where it bites the hardest. Testing that your happy-path checkout works against a real database is one thing. Forcing that same database to fail mid-transaction so you can test your rollback is a different scenario that requires significantly more effort to create.

Tests cover more area. This can be good, but more surface area means more reasons to break that have nothing to do with what you are trying to test. The further right you go, the more incidental things a test is implicitly depending on and the more often it’ll fail for reasons that often just get lumped together as “flakiness.” Using property-based tests instead of example-based tests can ameliorate this a bit, but it’s still a cost you’re eating. Another downside of this is that it’ll become harder to debug why the test failed. Eventually, you’ll hit a level of complexity where you need to investigate using application telemetry and telemetry analysis systems to understand the cause of the issue where on the left all you needed was the name of the test that failed and maybe a debugger.

That said, the extra surface area of tests on the right gives them a property that is invaluable: the test is more representative of production usage. This is the payoff that pulls us in this direction in the first place. A left-end test proves a function works in isolation; it cannot prove the function is wired up correctly, called with the right arguments, that the three things it coordinates actually coordinate, or that the problem it is solving is actually what the application needs rather than a different problem that is just similar enough that it also fits the ticket’s requirements. The further right you go, the more your test resembles what your users actually do, and the more a green result actually means the thing works.

So as you move right, the cost of the test gets worse (maintenance cost, development effort to set it up, time cost of investigating errors), but in exchange you get a higher fidelity test that better models reality. That’s the big tradeoff. Notice that none of it is about which label the test wears. The label is a side effect of where you landed when writing the test ‡.

‡ You can also (quite successfully in my mind) argue that the structure of the test is a natural consequence of how you structured the code, which makes this just as much a question of application architecture as it is one of testing strategy.

Where you want to land depends on what and why

So if you forget about the labels, how do you think about your test suite? You strip things down to two basic questions: what are you testing and what are you asking of it?

Some tests earn their keep on the left. A pricing calculation, a parser, a date-math utility, a state machine’s transition rules — these all have the kind of self-contained logic that make them a natural fit for narrow isolated tests. And the gnarlier and more branch heavy this logic is, the more valuable your isolated tests become. It’s a lot more work to try to explore all the complex permutations of a parser from elsewhere in the application than it is to test all of that behavior on the parser in isolation.

Other things only mean anything when they’re on the right. “Can a new user sign up, verify their email, and log in?” is not a question you can answer by testing a function in isolation, no matter how many functions you isolate. The value of the test is that it crosses the boundaries, and the boundaries are where issues often arise for functionality like this. Mock them away and you’ve tested everything except the thing you were worried about.

The useful question is never “should we write unit tests or integration tests?” It’s “what am I actually trying to find out, and where on the axis does a test have to live to tell me that?” Sometimes that’s the far left. Sometimes it’s the far right. Usually you want some of both, aimed at different questions.

And on the application, architecture, and tooling

Even with a clear question, where the related test should live on the axis moves depending on application and the tools available for testing it.

Tests for a pure library with no I/O might live almost entirely on the left, and that’s not laziness — there’s genuinely not much else to exercise. An extreme example would be something like the much maligned left-pad, if you have a package like this in your project, would you write a narrow test for it or would you spin up headless Chrome instance and try to test it via rendering some component that uses it? Sure, they’d both test it, but in the latter case you’ve done way more work, and if this test fails, how much effort will it take for you to narrow down that it failed because of a change in left-pad vs a global style or UI library change?

A simple CRUD web app’s most valuable tests probably sit somewhere in the middle of the axis, where the routing, validation, serialization, and database access meet, because that seam is where its real behavior actually happens. There probably isn’t a lot of complex gnarly logic that you can only really explore via narrow focused tests, and operations are simple enough that you can probably track down the source of failures without too much difficulty. If the tools for doing so are good enough, you might even go all the way to the right and mostly test the application as a fully integrated running whole.

A distributed system pushes you rightward whether you like it or not: the interesting failures are in the interactions between services, timeouts, retries, and partial failures, none of which exist inside any single component. You can see this in tools like Jepsen that spin up and manipulate a whole running distributed system in order to test it. Or, you could argue that chaos engineering is another expression of this. If you want to answer the question “how will my distributed system perform in production when failures occur” it’s hard to argue with the fidelity you get by actually creating failures in production (although you might have quite the argument on your hands convincing people that this is worth the cost of potentially breaking production).

Tooling moves the line too, and this is the part the blanket rules tend to forget. The setup complexity we talked about previously isn’t fixed: it’s a function of your tools. The day someone wires up a reusable way to reliably spin up a fast, ephemeral database instance per test, a whole column of tests that used to be “too slow and flaky to bother with” quietly becomes cheap and easy. If you think about it, you typically want your tests as far to the right as possible as long as the cost (maintenance, setup, development effort etc.) doesn’t climb too high. It’s a bit of a pipe dream, but imagine you live in a world where it is trivial to spin up an entire system with all of its dependencies, trivial to inject arbitrary faults into the system, and trivial to track down the cause of a test failure; in such an environment, wouldn’t you want to write a lot more tests on the right-hand of the axis rather than the left?

This is one of the main reasons I distrust prescriptions for how you should distribute your testing effort. How your tests should be distributed along this axis is inherently going to be a function of things like your architecture, your tooling, what your project does, and other properties that can vary wildly between projects. Even within the same project, these properties change over time as the featureset grows, new tools are introduced, architectures are refined, etc. A rule that ignores this is, at best, a description of what worked on one project at one company at one moment, and it’s never going to be a universal law of nature.

Just write good tests

I’ve spent five sections telling you the categories don’t matter much. Let me be clear about what I’m not saying. I’m not saying the categories are meaningless, or that I don’t think you should should say “unit test” again, or that there’s no such thing as too many slow end-to-end tests. Those words and phrases are fine as rough directions. What I take issue with is treating them as THE major unit of decision-making.

So here’s the decision-making I’d actually endorse. For anything you want to test, ask:

  1. What am I trying to find out? (What bug would this catch that I care about?)
  2. Where on the axis does a test have to live to find it?
  3. What will that cost me: to write, to run, to keep running, to update when new functionality is added?
  4. Is the thing I learn worth that cost?

And if the answer to #4 is “no”, you have three options:

  1. Don’t test it.
  2. Reduce the cost of the test. How this is done will always be situational and project dependent: it might be adopting testcontainers, it could be refactoring the application, or maybe it’s introducing deployment tooling that allows you to easily spin up full ephemeral test environments.
  3. Decompose the big question into smaller questions that will, hopefully, in aggregate capture the scenario you really care about and repeat the process

That’s it. Answer those and the “level” sorts itself out as a byproduct without you ever needing to decide whether you should write a unit test or an integration test. A good test suite isn’t a product of whether it fits into a certain polygon of categories. It’s a set of tests that each answer a question you care about, at a cost you’re willing to pay. Aim for that directly.