December 19, 2011
Do you remember your first computer program? When you had finished writing it, what was the first thing you did? You did the simplest possible test: you ran it.
As programs increase in size, so do the amount of possible tests. It’s worth considering which tests we actually end up running: imagine the children’s game Battleship, where the ocean is the space of all possible program executions, the battleships are the bugs that you are looking for, and each individual missile you fire is a test you run (white if the test passes, red if the test fails.) You don’t have infinite missiles, so you have to decide where you are going to send them.

In the case of “your first computer program,” the answer seems pretty obvious: there’s only one way to run the program, only a few cases to test.
But this fantasy is quickly blown away by an encounter with real software. Even if your program has no inputs, hardware, operating system, development environment, and other environmental factors immediately increase the space of tests. Add explicit inputs and nondeterminism to the application, and you’re looking at the difference between a swimming pool and an ocean.

How do we decide what to test? What is our strategy—where do we send more missiles, where do we send less? Different testing strategies result in different distributions of tests on the space of all possible executions. Even though we may not be thinking about the distribution of test cases when we write up tests or run the whole system in an integration test, different test strategies result in different coverage.
For example, you might decide not to do any tests, and rely on your users to give you bug reports. The result is that you will end up with high coverage in frequently used areas of your application, and much less coverage in the rarely used areas. In some sense, this is an optimal strategy when you have a large user base willing to tolerate failure—though anyone who has run into bugs using software in unusual circumstances might disagree!

There is a different idea behind regression testing, where you add an automatic test for any bug that occurred in the past. Instead of focusing coverage on frequently used area, a regression test suite will end up concentrated on “tricky” areas of the application, the areas where the most bugs have been found in the past. The hypothesis behind this strategy is that regions of code that historically had bugs are more likely to have bugs in the future.

You might even have some a priori hypotheses about where bugs in applications occur; maybe you think that boundary cases in the application are most likely to have bugs. Then you might reasonable focus your testing efforts on those areas on the outset.

Other testing strategies might focus specifically on the distribution of tests. This is especially important when you are concerned about worst-case behavior (e.g. security vulnerabilities) as opposed to average-case behavior (ordinary bugs.) Fuzz testing, for example, involves randomly spattering the test space without any regard to such things as usage frequency: the result is that you get a lot more distribution on areas that are rarely used and don’t have many discovered bugs.

You might notice, however, that while fuzz testing changes the distribution of tests, it doesn’t give any guarantees. In order to guarantee that there aren’t any bugs, you’d have to test every single input, which in modern software engineering practice is impossible. Actually, there is a very neat piece of technology called the model checker, designed specifically with all manner of tricks for speed to do this kind of exhaustive testing. For limited state spaces, anyway—there are also more recent research projects (e.g. Alloy) which perform this exhaustive testing, but only up to a certain depth.

Model checkers are “dumb” in some sense, in that they don’t really understand what the program is trying to do. Another approach we might take is to take advantage of the fact that we know how our program works, in order to pick a few, very carefully designed test inputs, which “generalize” to cover the entire test space. (We’ll make this more precise shortly.)

The diagram above is a bit misleading, however: test-cases rarely generalize that readily. One might even say that the ability to generalize behavior of specific tests to the behavior of the program is precisely what distinguishes a good program from a bad one. A bad program is filled with many, many different cases, all of which must be tested individually in order to achieve assurance. A good program is economical in its cases, it tries to be as complex as the problem it tries to solve, and no more.

What does it mean to say that a test-case generalizes? My personal belief is that chunks of the test input space which are said to be equivalent to each other correspond to a single case, part of a larger mathematical proof, which can be argued in a self-contained fashion. When you decompose a complicated program into parts in order to explain what it does, each of those parts should correspond to an equivalence partition of the program.
The corollary of this belief is that good programs are easy to prove correct.

This is a long way from “running the program to see if it works.” But I do think this is a necessary transition for any software engineer interested in making correct and reliable software (regardless of whether or not they use any of the academic tools like model checkers and theorem provers which take advantage of this way of thinking.) At the end of the day, you will still need to write tests. But if you understand the underlying theory behind the distributions of tests you are constructing, you will be much more effective.
Postscript. The relationship between type checking and testing is frequently misunderstood. I think this diagram sums up the relationship well:

Types eliminate certain regions of bugs and fail to affect others. The idea behind dependent types is to increase these borders until they cover all of the space, but the benefits are very tangible even if you only manage to manage a subset of the test space.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
November 14, 2011tl;dr — Save this page for future reference.
Have you ever been in the situation where you need to quickly understand what a piece of code in some unfamiliar language does? If the language looks a lot like what you’re comfortable with, you can usually guess what large amounts of the code does; even if you may not be completely familiar how all the language features work.
For Haskell, this is a little more difficult, since Haskell syntax looks very different from traditional languages. But there’s no really deep difference here; you just have to squint at it just right. Here is a fast, mostly incorrect, and hopefully useful guide for interpreting Haskell code like a Pythonista. By the end, you should be able to interpret this fragment of Haskell (some code elided with ...):
runCommand env cmd state = ...
retrieveState = ...
saveState state = ...
main :: IO ()
main = do
args <- getArgs
let (actions, nonOptions, errors) = getOpt Permute options args
opts <- foldl (>>=) (return startOptions) actions
when (null nonOptions) $ printHelp >> throw NotEnoughArguments
command <- fromError $ parseCommand nonOptions
currentTerm <- getCurrentTerm
let env = Environment
{ envCurrentTerm = currentTerm
, envOpts = opts
}
saveState =<< runCommand env command =<< retrieveState
Types. Ignore everything you see after :: (similarly, you can ignore type, class, instance and newtype. Some people claim that types help them understand code; if you’re a complete beginner, things like Int and String will probably help, and things like LayoutClass and MonadError won’t. Don’t worry too much about it.)
Arguments. f a b c translates into f(a, b, c). Haskell code omits parentheses and commas. One consequence of this is we sometimes need parentheses for arguments: f a (b1 + b2) c translates into f(a, b1 + b2, c).
Dollar sign. Since complex statements like a + b are pretty common and Haskellers don’t really like parentheses, the dollar sign is used to avoid parentheses: f $ a + b is equivalent to the Haskell code f (a + b) and translates into f(a + b). You can think of it as a big opening left parenthesis that automatically closes at the end of the line (no need to write )))))) anymore!) In particular, if you stack them up, each one creates a deeper nesting: f $ g x $ h y $ a + b is equivalent to f (g x (h y (a + b))) and translates into f(g(x,h(y,a + b)) (though some consider this bad practice).
In some code, you may see a variant of $: <$> (with angled brackets). You can treat <$> the same way as you treat $. (You might also see <*>; pretend that it’s a comma, so f <$> a <*> b translates to f(a, b). There’s not really an equivalent for regular $)
Backticks. x `f` y translates into f(x,y). The thing in the backticks is a function, usually binary, and the things to the left and right are the arguments.
Equals sign. Two possible meanings. If it’s at the beginning of a code block, it just means you’re defining a function:
doThisThing a b c = ...
==>
def doThisThing(a, b, c):
...
Or if you see it to near a let keyword, it’s acting like an assignment operator:
let a = b + c in ...
==>
a = b + c
...
Left arrow. Also acts like an assignment operator:
a <- createEntry x
==>
a = createEntry(x)
Why don’t we use an equals sign? Shenanigans. (More precisely, createEntry x has side effects. More accurately, it means that the expression is monadic. But that’s just shenanigans. Ignore it for now.)
Right arrow. It’s complicated. We’ll get back to them later.
Do keyword. Line noise. You can ignore it. (It does give some information, namely that there are side effects below, but you never see this distinction in Python.)
Return. Line-noise. Also ignore. (You’ll never see it used for control flow.)
Dot. f . g $ a + b translates to f(g(a + b)). Actually, in a Python program you’d probably have been more likely to see:
x = g(a + b)
y = f(x)
But Haskell programmers are allergic to extra variables.
Bind and fish operators. You might see things like =<<, >>=, <=< and >=>. These are basically just more ways of getting rid of intermediate variables:
doSomething >>= doSomethingElse >>= finishItUp
==>
x = doSomething()
y = doSomethingElse(x)
finishItUp(y)
Sometimes a Haskell programmer decides that it’s prettier if you do it in the other direction, especially if the variable is getting assigned somewhere:
z <- finishItUp =<< doSomethingElse =<< doSomething
==>
x = doSomething()
y = doSomethingElse(x)
z = finishItUp(y)
The most important thing to do is to reverse engineer what’s actually happening by looking at the definitions of doSomething, doSomethingElse and finishItUp: it will give you a clue what’s “flowing” across the fish operator. If you do that, you can read <=< and >=> the same way (they actually do function composition, like the dot operator). Read >> like a semicolon (e.g. no assignment involved):
doSomething >> doSomethingElse
==>
doSomething()
doSomethingElse()
Partial application. Sometimes, Haskell programmers will call a function, but they won’t pass enough arguments. Never fear; they’ve probably arranged for the rest of the arguments to be given to the function somewhere else. Ignore it, or look for functions which take anonymous functions as arguments. Some of the usual culprits include map, fold (and variants), filter, the composition operator ., the fish operators (=<<, etc). This happens a lot to the numeric operators: (+3) translates into lambda x: x + 3.
Control operators. Use your instinct on these: they do what you think they do! (Even if you think they shouldn’t act that way.) So if you see: when (x == y) $ doSomething x, it reads like “When x equals y, call doSomething with x as an argument.”
Ignore the fact that you couldn’t actually translate that into when(x == y, doSomething(x)) (Since, that would result in doSomething always being called.) In fact, when(x == y, lambda: doSomething x) is more accurate, but it might be more comfortable to just pretend that when is also a language construct.
if and case are built-in keywords. They work the way you’d expect them to.
Right arrows (for real!) Right arrows have nothing to do with left arrows. Think of them as colons: they’re always nearby the case keyword and the backslash symbol, the latter of which is lambda: \x -> x translates into lambda x: x.
Pattern matching using case is a pretty nice feature, but a bit hard to explain in this blog post. Probably the easiest approximation is an if..elif..else chain with some variable binding:
case moose of
Foo x y z -> x + y * z
Bar z -> z * 3
==>
if isinstance(moose, Foo):
x = moose.x # the variable binding!
y = moose.y
z = moose.z
return x + y * z
elif isinstance(moose, Bar):
z = moose.z
return z * 3
else:
raise Exception("Pattern match failure!")
Bracketing. You can tell something is a bracketing function if it starts with with. They work like contexts do in Python:
withFile "foo.txt" ReadMode $ \h -> do
...
==>
with open("foo.txt", "r") as h:
...
(You may recall the backslash from earlier. Yes, that’s a lambda. Yes, withFile is a function. Yes, you can define your own.)
Exceptions. throw, catch, catches, throwIO, finally, handle and all the other functions that look like this work essentially the way you expect them to. They may look a little funny, however, because none of these are keywords: they’re all functions, and follow all those rules. So, for example:
trySomething x `catch` \(e :: IOException) -> handleError e
===
catch (trySomething x) (\(e :: IOException) -> handleError e)
==>
try:
trySomething(x)
except IOError as e:
handleError(e)
Maybe. If you see Nothing, it can be thought of as None. So isNothing x tests if x is None. What’s the opposite of it? Just. For example, isJust x tests if x is not None.
You might see a lot of line noise associated with keeping Just and None in order. Here’s one of the most common ones:
maybe someDefault (\x -> ...) mx
==>
if mx is None:
x = someDefault
else:
x = mx
...
Here’s one specific variant, for when a null is an error condition:
maybe (error "bad value!") (\x -> ...) x
==>
if x is None:
raise Exception("bad value!")
Records. The work they way you’d expect them too, although Haskell lets you create fields that have no names:
data NoNames = NoNames Int Int
data WithNames = WithNames {
firstField :: Int,
secondField :: Int
}
So NoNames would probably be represented as a tuple (1, 2) in Python, and WithNames a class:
class WithNames:
def __init__(self, firstField, secondField):
self.firstField = firstField
self.secondField = secondField
Then creation is pretty simple NoNames 2 3 translates into (2, 3), and WithNames 2 3 or WithNames { firstField = 2, secondField = 3 } translates into WithNames(2,3).
Accessors are a little more different. The most important thing to remember is Haskellers put their accessors before the variable, whereas you might be most familiar with them being after. So field x translates to x.field. How do you spell x.field = 2? Well, you can’t really do that. You can copy one with modifications though:
return $ x { field = 2 }
==>
y = copy(x)
y.field = 2
return y
Or you can make one from scratch if you replace x with the name of the data structure (it starts with a capital letter). Why do we only let you copy data structures? This is because Haskell is a pure language; but don’t let that worry you too much. It’s just another one of Haskell’s quirks.
List comprehensions. They originally came from the Miranda-Haskell lineage! There are just more symbols. :
[ x * y | x <- xs, y <- ys, y > 2 ]
==>
[ x * y for x in xs for y in ys if y > 2 ]
It also turns out Haskellers often prefer list comprehensions written in multi-line form (perhaps they find it easier to read). They look something like:
do
x <- xs
y <- ys
guard (y > 2)
return (x * y)
So if you see a left arrow and it doesn’t really look like it’s doing side effects, maybe it’s a list comprehension.
More symbols. Lists work the way you would expect them to in Python; [1, 2, 3] is in fact a list of three elements. A colon, like x:xs means construct a list with x at the front and xs at the back (cons, for you Lisp fans.) ++ is list concatenation. !! means indexing. Backslash means lambda. If you see a symbol you don’t understand, try looking for it on Hoogle (yes, it works on symbols!).
More line noise. The following functions are probably line noise, and can probably be ignored. liftIO, lift, runX (e.g. runState), unX (e.g. unConstructor), fromJust, fmap, const, evaluate, an exclamation mark before an argument (f !x), seq, a hash sign (e.g. I# x).
Bringing it all together. Let’s return to the original code fragment:
runCommand env cmd state = ...
retrieveState = ...
saveState state = ...
main :: IO ()
main = do
args <- getArgs
let (actions, nonOptions, errors) = getOpt Permute options args
opts <- foldl (>>=) (return startOptions) actions
when (null nonOptions) $ printHelp >> throw NotEnoughArguments
command <- fromError $ parseCommand nonOptions
currentTerm <- getCurrentTerm
let env = Environment
{ envCurrentTerm = currentTerm
, envOpts = opts
}
saveState =<< runCommand env command =<< retrieveState
With some guessing, we can pop out this translation:
def runCommand(env, cmd, state):
...
def retrieveState():
...
def saveState(state):
...
def main():
args = getArgs()
(actions, nonOptions, errors) = getOpt(Permute(), options, args)
opts = **mumble**
if nonOptions is None:
printHelp()
raise NotEnoughArguments
command = parseCommand(nonOptions)
currentTerm = getCurrentTerm()
env = Environment(envCurrentTerm=currentTerm, envOpts=opts)
state = retrieveState()
result = runCommand(env, command, state)
saveState(result)
This is not bad, for a very superficial understanding of Haskell syntax (there’s only one obviously untranslatable bit, which requires knowing what a fold is. Not all Haskell code is folds; I’ll repeat, don’t worry about it too much!)
Most of the things I have called “line noise” actually have very deep reasons behind them, and if you’re curious behind the actual reasons behind these distinctions, I recommend learning how to write Haskell. But if you’re just reading Haskell, I think these rules should be more than adequate.
Thanks to Keegan McAllister, Mats Ahlgren, Nelson Elhage, Patrick Hurst, Richard Tibbetts, Andrew Farrell and Geoffrey Thomas for comments. Also thanks to two kind denizens of #python, asdf and talljosh`, for acting as Python-using guinea pigs.
Postscript. If you’re really curious what foldl (>>=) (return startOptions) actions does, it implements the chain of responsibility pattern. Hell yeah.
October 24, 2011What do automatic memory management, static types and purity have in common? They are methods which take advantage of the fact that we can make programs obviously correct (for some partial definition of correctness) upon visual inspection. Code using automatic memory management is obviously correct for a class of memory bugs. Code using static types is obviously correct for a class of type bugs. Code using purity (no mutable references or side effects) is obviously correct for a class of concurrency bugs. When I take advantage of any of these techniques, I don’t have to prove my code has no bugs: it just is, automatically!
Unfortunately, there’s a catch. What all of these “obviously correct” methodologies ask you do is to sacrifice varying degrees of expressiveness at their altar. No more pointer tricks. No more playing fast and loose with data representation. No more mutation. If this expressiveness was something most people really didn’t want anyway (e.g. memory management), it is happily traded away. But if it’s something they want, well, as language designers, we’re making it harder for people to do things that they want to do, and it shouldn’t surprise us when they grab their torches and pitchforks and storm the ivory tower, assertions about correctness and maintainability be damned.
It seems to me that we must fight fire with fire: if we’re going to take away features, we better be giving them compelling new features. With static types you also get pattern matching, QuickCheck style property testing, and performance benefits. With purity, you get software transactional memory and speculative evaluation. Discovering and implementing more of these “killer apps” is the key to adoption. (Some research that I’m currently doing with Adam Chlipala is leveraging purity to offer automatic caching for web applications. It’s not much, but I think it’s in the right direction.)
I still have a fanatical devotion to correctness. But these days, I suspect that for most people, it’s something bitter, like medicine, to be taken with some better tasting features. That’s fine. Our challenge, as programming language researchers, is to exploit correctness to bring tangible benefits now, rather than nebulous maintainability benefits later.
Thanks Nelson Elhage and Keegan McAllister for their comments.
Postscript: Performance of static types versus dynamic types. An earlier draft of this post pointed at Quora’s decision to move to Scala from Python as a clear indicator of this fact. Unfortunately, as several pre-readers pointed out, there are too many confounding factors to make this claim definitive: CPython was never explicitly engineered for performance, whereas the JVM had decades of work poured into it. So I’ll have to leave you with a more theoretical argument for the performance of static types: the optimization techniques of runtime just-in-time compilers for dynamic compilers involves identifying sections of code which are actually statically typed, and compiling them into the form a static compiler will. So, if you know this information ahead of time, you will always do better than if you know this information later: it’s only a question of degree. (Of course, this doesn’t address the possibility that JIT can identify information that would have been difficult to determine statically.)
Postscript: Shared transactional memory. Joe Duffy had a great retrospective on transactional memory and the experience he had attempting to implement it for Microsoft’s stack. And despite a great enthusiasm for this idea, it’s interesting to note this quote:
Throughout all of this, we searched and searched for the killer TM app. It’s unfair to pin this on TM, because the industry as a whole still searches for a killer concurrency app. But as we uncovered more successes in the latter, I became less and less convinced that the killer concurrency apps we will see broadly deployed in the next 5 years needed TM. Most enjoyed natural isolation, like embarrassingly parallel image processing apps. If you had sharing, you were doing something wrong.
Richard Tibbetts points out that concurrency is often addressed at an architectural level lower than what most working programmers want to deal with, and so while STM is a killer application for those platforms, most developers don’t want to think about concurrency at all.