Bugs and Battleships : ezyang's blog

Bugs and Battleships

December 19, 2011

Do you remember your first computer program? When you had finished writing it, what was the first thing you did? You did the simplest possible test: you ran it.

As programs increase in size, so do the amount of possible tests. It’s worth considering which tests we actually end up running: imagine the children’s game Battleship, where the ocean is the space of all possible program executions, the battleships are the bugs that you are looking for, and each individual missile you fire is a test you run (white if the test passes, red if the test fails.) You don’t have infinite missiles, so you have to decide where you are going to send them.

In the case of “your first computer program,” the answer seems pretty obvious: there’s only one way to run the program, only a few cases to test.

But this fantasy is quickly blown away by an encounter with real software. Even if your program has no inputs, hardware, operating system, development environment, and other environmental factors immediately increase the space of tests. Add explicit inputs and nondeterminism to the application, and you’re looking at the difference between a swimming pool and an ocean.

How do we decide what to test? What is our strategy—where do we send more missiles, where do we send less? Different testing strategies result in different distributions of tests on the space of all possible executions. Even though we may not be thinking about the distribution of test cases when we write up tests or run the whole system in an integration test, different test strategies result in different coverage.

For example, you might decide not to do any tests, and rely on your users to give you bug reports. The result is that you will end up with high coverage in frequently used areas of your application, and much less coverage in the rarely used areas. In some sense, this is an optimal strategy when you have a large user base willing to tolerate failure—though anyone who has run into bugs using software in unusual circumstances might disagree!

There is a different idea behind regression testing, where you add an automatic test for any bug that occurred in the past. Instead of focusing coverage on frequently used area, a regression test suite will end up concentrated on “tricky” areas of the application, the areas where the most bugs have been found in the past. The hypothesis behind this strategy is that regions of code that historically had bugs are more likely to have bugs in the future.

You might even have some a priori hypotheses about where bugs in applications occur; maybe you think that boundary cases in the application are most likely to have bugs. Then you might reasonable focus your testing efforts on those areas on the outset.

Other testing strategies might focus specifically on the distribution of tests. This is especially important when you are concerned about worst-case behavior (e.g. security vulnerabilities) as opposed to average-case behavior (ordinary bugs.) Fuzz testing, for example, involves randomly spattering the test space without any regard to such things as usage frequency: the result is that you get a lot more distribution on areas that are rarely used and don’t have many discovered bugs.

You might notice, however, that while fuzz testing changes the distribution of tests, it doesn’t give any guarantees. In order to guarantee that there aren’t any bugs, you’d have to test every single input, which in modern software engineering practice is impossible. Actually, there is a very neat piece of technology called the model checker, designed specifically with all manner of tricks for speed to do this kind of exhaustive testing. For limited state spaces, anyway—there are also more recent research projects (e.g. Alloy) which perform this exhaustive testing, but only up to a certain depth.

Model checkers are “dumb” in some sense, in that they don’t really understand what the program is trying to do. Another approach we might take is to take advantage of the fact that we know how our program works, in order to pick a few, very carefully designed test inputs, which “generalize” to cover the entire test space. (We’ll make this more precise shortly.)

The diagram above is a bit misleading, however: test-cases rarely generalize that readily. One might even say that the ability to generalize behavior of specific tests to the behavior of the program is precisely what distinguishes a good program from a bad one. A bad program is filled with many, many different cases, all of which must be tested individually in order to achieve assurance. A good program is economical in its cases, it tries to be as complex as the problem it tries to solve, and no more.

What does it mean to say that a test-case generalizes? My personal belief is that chunks of the test input space which are said to be equivalent to each other correspond to a single case, part of a larger mathematical proof, which can be argued in a self-contained fashion. When you decompose a complicated program into parts in order to explain what it does, each of those parts should correspond to an equivalence partition of the program.

The corollary of this belief is that good programs are easy to prove correct.

This is a long way from “running the program to see if it works.” But I do think this is a necessary transition for any software engineer interested in making correct and reliable software (regardless of whether or not they use any of the academic tools like model checkers and theorem provers which take advantage of this way of thinking.) At the end of the day, you will still need to write tests. But if you understand the underlying theory behind the distributions of tests you are constructing, you will be much more effective.

Postscript. The relationship between type checking and testing is frequently misunderstood. I think this diagram sums up the relationship well:

Types eliminate certain regions of bugs and fail to affect others. The idea behind dependent types is to increase these borders until they cover all of the space, but the benefits are very tangible even if you only manage to manage a subset of the test space.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

26 Comments

paurullan
2011-12-19 13:45:47
Great :D
David Barbour
2011-12-19 14:52:09
Excellent article.
Joel Wang
2011-12-20 04:47:44
As a novice software tester I find this very intuitive. Thanks!
Maintenance Man
2011-12-20 15:39:56
At first I thought your diagrams were battleship layouts. Heheh.
Looking at your last diagram, is the conclusion of the matter to employ types and type checking to increase testing efficiency?
Aidan Delaney
2011-12-20 17:42:33
As someone who has taught software testing to undergrads for several years, I commend your insight and choice of metaphor. This post clarifies exactly what I try and teach. Would you mind if I used some of the text and images on slides in my class?
John Regehr
2011-12-20 23:10:07
Nice!
mpe
2011-12-21 07:51:36
Great article. Though I think your final picture grossly overstates the space that is taken up by bugs that type checking can find, at least in real programs.
Natalia
2011-12-21 14:02:16
Good article! Thank you!
Elliot Smith
2011-12-21 18:57:02
Great article. Thanks
Anonymous
2011-12-22 09:57:49
make programs single-shot, only. compose them.
Pk
2011-12-25 17:31:13
Edward, if you write a book in the same approach as your articles, I would buy it.
Critic
2011-12-25 19:06:49
Um .. so what’s new? I think Dijkstra has said everything you just did, and much better.
Edward Z. Yang
2011-12-25 19:17:42
Are you referring to the very end, where I channel Dijkstra’s essay “On the Cruelty of Really Teaching Computer Science.” Actually, that was never the original intent of the essay (which was this vague idea I had in my head that test suites ought not to be thought of as discrete things, but as continuous distributions.) I may be channeling Dijkstra there, but the relevant essay escapes me.
Anonymous
2011-12-31 14:24:39
It occurs to me that modularity is kind of like separating the rows and columns of the Battleship board, in order to reduce the number of dimensions you need to search in.
Pelle
2012-01-01 13:56:55
Great article. Also have you looked at QuickCheck type of property based testing?
Tom Crayford
2012-01-06 18:21:40
This post dearly needs a diagram about isolated (ie mock/stub based) unit testing.
Edward Z. Yang
2012-01-06 22:44:52
Tom: Metaphors should not be over-stretched, and I’m not really sure what mock/stub based testing would look like.
rdm
2012-03-15 12:34:19
This was a great read.
That said, this essay said almost nothing about “feature testing”. (Feature testing is not focused on “finding bugs”, it’s about dealing with cost/benefit issues, it’s about sketching out the design, and is about tracking who to be discussing priorities with, when things need to be changing. Features are not uniformly valuable, and have costs.)
Edward Z. Yang
2012-03-15 17:40:17
Yes, I’m not really trying to talk about feature testing, because it’s a bit different. Bugs tend to be a bit more clear cut (though they’re not black and white), cost/benefit of features is much mushier. Certainly you’re not going to be able to formally verify something like this.
Illustration in Teaching | My Own Fortune
2012-09-22 09:05:00
[…] “Bugs and Battleships” by Edward Z. Yang […]
New top story on Hacker News: Bugs and Battleships (2011) – protipsss
2020-04-06 11:41:20
[…] Bugs and Battleships (2011) 3 by gklitt | 0 comments on Hacker News. […]
New top story on Hacker News: Bugs and Battleships (2011) – News about world
2020-04-06 11:41:53
[…] Bugs and Battleships (2011) 3 by gklitt | 0 comments on Hacker News. […]
New top story on Hacker News: Bugs and Battleships (2011) – Latest news
2020-04-06 11:42:23
[…] Bugs and Battleships (2011) 3 by gklitt | 0 comments on Hacker News. […]
New top story on Hacker News: Bugs and Battleships (2011) – Hckr News
2020-04-06 11:53:32
[…] Bugs and Battleships (2011) 4 by gklitt | 0 comments on Hacker News. […]
New top story on Hacker News: Bugs and Battleships (2011) – Ultimate News
2020-04-06 11:55:18
[…] Bugs and Battleships (2011) 5 by gklitt | 0 comments on Hacker News. […]
Bugs and Battleships (2011) | Hacker News
2020-07-20 12:10:51
[…] Bugs and Battleships (2011) (ezyang.com) […]