Aiming for 100% Unit Test Coverage will Likely Increase the Amount of Bugs in your Code
I have a lot, a lot, a lot to say on the topic of unit tests in software, because I think the vast majority of software developers are doing it wrong. But the explanation for why it’s wrong is very long. I’ve tried several times to write an essay explaining it all, but it always ended up being way too long. So now I will try to break up the argument into smaller modular pieces, and produce multiple essays on the topic. This is the first such essay.
For the purpose of this post, I want to take as an axiom we write unit tests to reduce the chances of introducing bugs into the code base.
I wrote a post previously about how there are many different types of unit tests, and not all of them are regression tests.
Two things:
If you like, imagine the axiom has a bunch of caveats, like “The Primary Purpose of Most Unit Tests is to Reduce Bugs”.
But also, even without the caveats, I don’t think that axiom contradicts that earlier post.
To expand on (2) a bit, in that earlier post, I said:
For example, someone might write unit tests to document the contract of a method.
Also, it looks like I neglected to talk about “Understanding Tests”, which are (often) Unit Tests that document some quirk of the code you don’t own.
I claim that all of the above — documentation tests, understanding tests, and regression tests — ultimately are there to help reduce bugs.
So for the purpose of this post, let’s just accept as an axiom that “yes, the purpose of unit tests is to reduce bugs” so that I can proceed to make my main point.
My main point is that most developers accept this axiom, yet they then proceed to write tests that run counter to this goal. That is to say, the unit tests they write tend to lead to an increase in the number of bugs.
One of the most common mistakes I see is developers trying to increase the “test coverage” metric on their projects.
There’s a whole separate post I can write about Goodharting1, which I'll save for another time, so I won't get into it here. But in addition to the whole problem with Goodharting, trying to increase your test coverage implicitly means you're not thinking about what code should and should not be unit-tested.
Many developers are surprised to discover that there’s code that should not be unit-tested, but it’s true: Your unit tests should assert the contract of your method, not its implementation. That means that your unit tests should not test any part of the code that is a pure implementation detail.
Let me give you an example:
final static int RECTANGLE_OBLONG = 1
final static int RECTANGLE_SQUARE = 2
/**
* Determines the type of rectangle, e.g. square or oblong.
* Behavior is not defined for width or height equal to zero.
*/
int getRectangleType(int width, int height) {
if (width == 0 || height == 0) {
throw new IllegalArgumentException("neither width nor height can be zero");
}
/*TODO: Something weird happens when width is 42. Maybe some sort of hardware bug? See JIRA-123 for more details.*/
if (width == 42) {
logger.debug("Got width of 42 and height was {}", height);
}
if (width == height) {
return RECTANGLE_SQUARE;
} else {
return RECTANGLE_OBLONG;
}
}
An oblong rectangle is a rectangle that’s not a square (i.e., its width is different from its height).
The contract for the getRectangleType
function is that you give it the width and height of the rectangle, and it tells you what type of rectangle you have. That’s primarily implicit, inferred from: the combination of the text of Javadoc; the names and types of the parameters; and the adjacency of the method to the nearby defined constants. But this contract also explicitly calls out that its behavior is not defined when width or height is zero. And also, implicitly, you could infer that it’s probably not defined for widths and heights that are negative, despite the Javadoc not explicitly calling that out.2
If you peek into the implementation of the function, you’ll notice that there seems to be some bug that the developers are trying to track down. Something to do with the number 42. The comment contains what seems to be an ID to an issue in their issue-tracking software, where they are probably tracking the investigation of that bug. It seems like, for now, they’re just logging whenever width is 42 to help with their debugging.
Okay, so given all this, what tests should we write? Or, more pertinently for this article, which lines of code should we not test?
Hopefully, we’re all in agreement that: we should not write a test that mocks the logger, and then asserts that a message gets logged at the debug level when the width is set to 42. That’s an implementation detail that is totally irrelevant to the behavior that callers rely on when they invoke this function. The implementers of this function should be free to remove that line (for example, when they’ve finally root-caused the bug) without breaking any of the tests.
If your tests break when you make a change that did not introduce a bug, that’s bad
Recall our axiom: The purpose of unit tests is to reduce bugs. The mechanism through which this happens is that when a bug is introduced into the code, we are hopeful that the tests will break. Of course, you might get some false negatives (that is, the tests do not break, despite the fact that some change did indeed introduce a bug), but that’s unavoidable, even if you have 100% test coverage. But you really don’t want any false positives: you don’t want your tests to ever break when in fact, there was no bug.
And one of the worst things that can happen is that your unit tests produce so many false positives that your team gets complacent and assumes that anytime the tests break, it’s a false positive: They make a change to the code, a test breaks, and then rather than wonder whether or not a bug got introduced, they just assume no bug was introduced, and they go and change the test around until the tests pass again.
If your team culture has reached this point, not only are your tests completely useless (the output of these tests is now pure noise), but they are also acting as mindless busy-work, needing to be updated after every change. At that point, better to just not have any tests at all: you’ll still be in the state where you have zero tests that help you with detecting bugs, but at least you’ll have more bandwidth to fix the bugs you (manually?) discover anyway.
If a given behavior is not part of the contract, there should not be a test asserting that behavior
The contract tells us what expectation the caller is allowed to have when they invoke the function under question. In the case of our getRectangleType
example, the contract does not guarantee that a log message will be produced when width is 42. Nor does the contract guarantee that the function will refrain from logging a message. So the implementer is free to add or remove log statements as they see fit, without any tests breaking.
This also means there’s an excellent chance that you will not achieve 100% test coverage on this method (unless your test coincidentally uses a width of 42 as one of its test inputs, unrelated to testing the log message). This is not because we’re actively avoiding 100% test coverage. Instead, it is because there will naturally be lines in the implementation that are not related to the contract, but that the implementer found helpful. As long as the contract of the function is thoroughly tested, it doesn’t matter whether the function is 100% covered, 90% covered, or even 1% covered.
Okay, so here’s a more controversial one:
If a contract states that a behavior is undefined, there should not be a test asserting that behavior
The contract explicitly says that the behavior is not defined when width or height is 0. So that means there should not be a test asserting what happens when we pass in a width or height of 0.
In our example, right now, the implementation throws an IllegalArgumentException
. There should not be a test that passes in a width of 0, and asserts that an IllegalArgumentException
gets thrown.
Why not? Because the implementer could change that implementation detail, and it would still satisfy the contract.
For example, the implementer might decide that a rectangle whose width and height are both zero is indeed a square rectangle, and a rectangle where one of the width and height (but not both) is zero is indeed an oblong rectangle.
Or the implementer may add a third constant RECTANGLE_DEGENERATE = 3
, and return that whenever at least one of the width or height is zero.
All of these changes do not violate the contract, and thus do not introduce a bug, and thus should not trigger a test failure. If a test asserted that an exception was thrown, that test would fail despite no new bug being introduced, which would be a false positive, which would be bad.
As an aside, undefined behavior in contracts is a powerful and often underestimated tool for allowing your system to evolve gracefully over time. However, that, too, is a topic that deserves its own future article.3
All this to say that the line that throws the IllegalArgumentException
will also not get covered by your tests, which again means that either you won’t have 100% code coverage or you’ll have net-negative value tests.
Aiming for 100% code coverage will lead you to writing worse-than-useless tests, which increase the odds of shifting your team culture into a mode with an increased tendency to introduce bugs.
The concept of an “implicit contract” versus an “explicit contract” deserves its own post, so I won’t get too deep into it here. By “contract”, I’m referring to the (sometimes implicit) agreement between the caller of a function and the implementer of a function: An agreement as to what expectations the implementer can make about the inputs to the function (the preconditions), and what expectations the caller can make about the outputs to the function (the post-conditions).
You can use formal methods and require that contracts be fully explicit and machine-verified. However, in practice, handwavily describing the contract in Javadoc, with readers generously interpreting the Javadoc (as opposed to “maliciously lawyering” the exact wording), gets you 80% of the benefit of contracts with 20% of the effort.
I made up those two numbers.
The basic idea is that tests are covariant in their post-conditions and contravariant in their preconditions. Of course, you want your functions to “do something useful”, or else no caller will ever call them. But by judiciously carving out parts of the behavior-space that the callers don’t really care about, you open up extension points for your contract to evolve without breaking any existing callers. This is extra important if your callers are “distant” from you, such that you cannot simply make a breaking change and expect to be able to track down all the callers and fix them.