ezyang's blog

the arc of software bends towards understanding

The new Reflections on Trusting Trust

In his classic essay Reflections on Trusting Trust, Ken Thompson describes a self-replicating compiler bug which is undetectable by source code inspection. The self-replication is made possible by the fact that most compilers are self-compiling: old versions of a compiler are used to compile new ones, and if the old version is malicious, it can slip the same bug when it detects it is compiling itself.

A new trend is precisely this self-hosting process, but for self-certifying typecheckers: typecheckers which are used to prove their own correctness. (Note that these are powerful typecheckers, close to being able to check arbitrary theorems about code.) This may seem a little odd, since I could write a trivial typechecker which always claimed it was correct. In order to work around this, we must bootstrap the correctness proof by proving the typechecker correct in another language (in the case of F*, this language is Coq). Once this has been done, we can then use this verified typechecker to check a specification of itself. This process is illustrated below.

image

image

image

The question then is whether or not such self-certifying typecheckers are similarly vulnerable to the problem Ken described for self-hosting compilers. For arguments sake, let’s assume that the backend compiler and runtime are certified (a strong assumption that is almost universally untrue, including for F*). Since the typechecker can’t insert malicious bugs into the programs it compiles (it only, you know, typechecks), one would have to rely on a bug in the source code itself. Surely such a bug would be obvious!

This is unclear: we have certified our implementation, but what of our specification? In Coq, we proved various theorems about the soundness and adequacy of our type system, which give us at least some hope that it is correct in the way we expect. But these proofs are nowhere to be seen in the emancipated F* world. If we want to evolve our specification (less plausible for a full-blooded dependently typed language, but within the realm of possibility for a less powerful one), we must turn back to Coq and adjust the relevant theorems. Otherwise, we run the risk of changing our type system to an unsound one.

image

image

Fortunately, that’s all we have to do: we can use the old F* type checker to certify the new one, rather than attempt to export certificates and reverify with them Coq. All told, though, don’t throw out your Coq code yet… not, at least, if you think your type system may change in the future.

Obviously Correct

What do automatic memory management, static types and purity have in common? They are methods which take advantage of the fact that we can make programs obviously correct (for some partial definition of correctness) upon visual inspection. Code using automatic memory management is obviously correct for a class of memory bugs. Code using static types is obviously correct for a class of type bugs. Code using purity (no mutable references or side effects) is obviously correct for a class of concurrency bugs. When I take advantage of any of these techniques, I don’t have to prove my code has no bugs: it just is, automatically!

Unfortunately, there’s a catch. What all of these “obviously correct” methodologies ask you do is to sacrifice varying degrees of expressiveness at their altar. No more pointer tricks. No more playing fast and loose with data representation. No more mutation. If this expressiveness was something most people really didn’t want anyway (e.g. memory management), it is happily traded away. But if it’s something they want, well, as language designers, we’re making it harder for people to do things that they want to do, and it shouldn’t surprise us when they grab their torches and pitchforks and storm the ivory tower, assertions about correctness and maintainability be damned.

It seems to me that we must fight fire with fire: if we’re going to take away features, we better be giving them compelling new features. With static types you also get pattern matching, QuickCheck style property testing, and performance benefits. With purity, you get software transactional memory and speculative evaluation. Discovering and implementing more of these “killer apps” is the key to adoption. (Some research that I’m currently doing with Adam Chlipala is leveraging purity to offer automatic caching for web applications. It’s not much, but I think it’s in the right direction.)

I still have a fanatical devotion to correctness. But these days, I suspect that for most people, it’s something bitter, like medicine, to be taken with some better tasting features. That’s fine. Our challenge, as programming language researchers, is to exploit correctness to bring tangible benefits now, rather than nebulous maintainability benefits later.

Thanks Nelson Elhage and Keegan McAllister for their comments.


Postscript: Performance of static types versus dynamic types. An earlier draft of this post pointed at Quora’s decision to move to Scala from Python as a clear indicator of this fact. Unfortunately, as several pre-readers pointed out, there are too many confounding factors to make this claim definitive: CPython was never explicitly engineered for performance, whereas the JVM had decades of work poured into it. So I’ll have to leave you with a more theoretical argument for the performance of static types: the optimization techniques of runtime just-in-time compilers for dynamic compilers involves identifying sections of code which are actually statically typed, and compiling them into the form a static compiler will. So, if you know this information ahead of time, you will always do better than if you know this information later: it’s only a question of degree. (Of course, this doesn’t address the possibility that JIT can identify information that would have been difficult to determine statically.)

Postscript: Shared transactional memory. Joe Duffy had a great retrospective on transactional memory and the experience he had attempting to implement it for Microsoft’s stack. And despite a great enthusiasm for this idea, it’s interesting to note this quote:

Throughout all of this, we searched and searched for the killer TM app. It’s unfair to pin this on TM, because the industry as a whole still searches for a killer concurrency app. But as we uncovered more successes in the latter, I became less and less convinced that the killer concurrency apps we will see broadly deployed in the next 5 years needed TM. Most enjoyed natural isolation, like embarrassingly parallel image processing apps. If you had sharing, you were doing something wrong.

Richard Tibbetts points out that concurrency is often addressed at an architectural level lower than what most working programmers want to deal with, and so while STM is a killer application for those platforms, most developers don’t want to think about concurrency at all.

Polyglot programming

Being back in town over MIT’s Infinite Activities Period is making me think about what kind of short lecture I want to try teaching. I’ve been turning over the idea of a polyglot programming class in my head: the idea is that while most people feel really comfortable with only one or two programming languages, you can learn how to make this knowledge work for you in almost any programming language you could possible language.

Unfortunately, I don’t have a good idea of what these skills actually are, nor do I have a sense for what kinds of things people would want to know. Nor do I think that I could jam this into two hours of lecturing: topics that I probably would want to cover are:

History of Programming Languages. Knowing how all the lineages tie together will help you figure out when a language feature will work the way you expect it to (since they just stole it from another language in the same line), and when, actually, it won’t work at all. It lets you nicely encapsulate the main big ideas of language features, which you can then explore the infinite variations of. It gives you groups of languages which mostly have the same idioms.

Street smarts and bootstrapping. What are the first things you should look for when you’re getting acquainted with a new language? Syntax? Cheat sheets? Tutorials? How to organize this torrent of information, what to do first, where to ask questions, what to learn how to do. How to interpret error messages you know nothing about. How to navigate the development ecosystem and assess libraries you know nothing about. How to source dive code in languages you know nothing about. Common bumps on the road towards Hello World. Unusual and universal ways of debugging.

Interoperability and FFI. What are the basic building blocks for higher-level data types in most of these languages, and what do they look like in memory? How do you make lots of different languages talk to each other efficiently! How about garbage collection, reference pinning and concurrency? Common impedance mismatches between languages.

Suggestions and comments would be appreciated.

Why you shouldn't do a PhD in systems

The opinions presented in this post are not necessarily mine. I’m just one very confused undergraduate senior with a lot of soul searching to do.

When I tell my friends, “I’m going to get a PhD,” I sometimes get the response, “Good for you!” But other times, I get the response, “Why would you want to do that?” as if I was some poor, misguided soul, brainwashed by a society which views research as the “highest calling”, the thing that the best go to, while the rest go join industry. “If you’re a smart hacker and you join industry,” they say, “you will have more fun immediately, making more impact and more money.”

Time. Getting a PhD takes a lot of time, they tell me. Most programs like to advertise something like five, or maybe six years, but if you actually look at the statistics, it’s actually something that could potentially extend up to nine or ten years. That’s a really huge chunk of your life: effectively all of your twenties and, honestly, there might be much better things for you to do with these tender years of your life. A PhD is merely a stepping stone, a necessary credential to be taken seriously in academia but not anything indicative of having made a significant contribution to a field. And for a stepping stone, it’s an extremely time consuming one.

There are many other things that could have happened during this time. You could have began a startup and seen it get acquired or sink over a period of three years: six years seems like two lifetimes in a context like that. You could have began a career as a professional in an extremely hot job market for software engineers, hopping from job to job until you found a work that you were genuinely interested in: as a PhD you are shackled to your advisor and your university. The facilities for change are so incredibly heavyweight: if you switch advisors you effectively have to start over, and it’s easy to think, “What did I do with the last three years of my life?”

Money. There is one thing you didn’t do in those last few years: make money. PhDs are the slave labor that make the academic complex run. It’s not that universities aren’t well funded by grants: indeed, the government spends large amounts of money funding research programs. But most of this money never finds its way to PhDs: you’re looking at a $30k stipend, when a software engineer can easily be making $150k to $200k in a few years at a software company. Even once you make it into a tenure track position, you are still routinely making less than people working for industry. You don’t go into academia expecting to get rich.

Scarcity. Indeed, you shouldn’t go into academia expecting to get much of anything. The available tenured positions are greatly outstripped by the number of PhD applicants, to the point that your bid into the academic establishment is more like a lottery ticket. You have to be doing a postdoc—e.g. in a holding pattern—at precisely the right time when a tenure position becomes vacated (maybe the professor died), or spend years building up your network of contacts in academia in hopes of landing a position through that connection. Most people don’t make it, even at a second or third tier university. The situation is similar for industrial research labs, which become rarer and rarer by the year: Microsoft Research is highly selective and as a prospective PhD, you are making a ten year bet about its ability to survive. Intel Labs certainly didn’t.

Tenure isn’t all that great. But even if you do make it to tenure, it isn’t actually all that great. You’ve spent the last decade and a half fighting for the position in a competitive environment that doesn’t allow for any change of pace (god forbid you disappear for a year while your tenure clock is ticking), and now what? You are now going to stay at the institution you got tenure at for the rest of your life: all you might have is a several month sabbatical every few years. It’s the ivory handcuffs, and you went through considerably more effort to put them on than that guy who went to Wall Street. And as for the work? Well, you still have to justify your work and get grant funding, you still need to serve on committees and do other tasks which you simply must do which are not at all related to your research. In industry, you could simply hire someone to handle your post on your university’s “Disciplinary Committee”—in academia, that’s simply not how it works.

Lack of access. And if you are a systems researcher, you don’t even get the facilities you need to do the large-scale research that is really interesting. Physicists get particle accelerators, Biologists get giant labs, but what does the systems researcher get? Certainly not a software system used by millions of users around the world. To get that sort of system, you have to go to industry. Just ask Matt Welsh, who made a splash leaving a tenured position at Harvard to go join Google. Working in this context lets you actually go and see if your crazy ideas go and work.

Don’t do it now. Of course, at this point I’m mentally protesting that this is all incredibly unfair to a PhD, that you do get more freedom, that maybe some people don’t care that much about money, that this is simply a question about value systems, and really, for some people, it’s the right decision. I might say that your twenties are also the best time to do your PhD, that academia is the correct late-career path, that you can still do a startup as a professor.

Perhaps, they say, but you need to figure out if this is the right decision for you. You need experience in both areas to make this decision, and the best time to do this is sampling industry for two or three years before deciding if you want to go to industry. After you graduate your PhD, people mentally set their timer on your potential as an academic, and if you don’t publish during that time people will stop taking you seriously. But if you start a PhD in your mid-twenties, no one will bat an eye. Everyone can have a bad software internship; don’t let that turn you away from industry. We solve cool problems. We are more diverse, in aggregate we give more freedom. In no sense of the word are we second-class.

They might be right. I don’t know.

Let's play a game

Ever wondered how Haskellers are magically able to figure out the implementation of functions just by looking at their type signature? Well, now you can learn this ability too. Let’s play a game.

You are an inventor, world renowned for your ability to make machines that transform things into other things. You are a proposer.

image

But there are many who would doubt your ability to invent such things. They are the verifiers.

image

The game we play goes as follows. You, the proposer, make a claim as to some wondrous machine you know how to implement, e.g. (a -> b) -> a -> b (which says given a machine which turns As into Bs, and an A, it can create a B). The verifier doubts your ability to have created such a machine, but being a fair minded skeptic, furnishes you with the inputs to your machine (the assumptions), in hopes that you can produce the goal.

image

As a proposer, you can take the inputs and machines the verifier gives you, and apply them to each other.

image

But that’s not very interesting. Sometimes, after the verifier gives you some machines, you want to make another proposal. Usually, this is because one of the machines takes a machine which you don’t have, but you also know how to make.

image

The verifier is obligated to furnish more assumptions for this new proposal, but these are placed inside the cloud of abstraction.

image

You can use assumptions that the verifier furnished previously (below the cloud of abstraction),

image

but once you’ve finished the proposal, all of the new assumptions go away. All you’re left with is a shiny new machine (which you ostensibly want to pass to another machine) which can be used for the original goal.

image

These are all the rules we need for now. (They constitute the most useful subset of what you can do in constructive logic.)

Let’s play a game.

image

Our verifier supplies the machines we need to play this game. Our goal is r.

image

That’s a lot of machines, and it doesn’t look like we can run any of them. There’s no way we can fabricate up an a from scratch to run the bottom one, so maybe we can make a a -> r. (It may seem like I’ve waved this proposal up for thin air, but if you look carefully it’s the only possible choice that will work in this circumstance.) Let’s make a new proposal for a -> r.

image

Our new goal for this sub-proposal is also r, but unlike in our original case, we can create an r with our extra ingredient: an a: just take two of the original machines and the newly furnished a. Voila, an r!

This discharges the cloud of abstraction, leaving us with a shiny new a -> r to pass to the remaining machine, and fulfill the original goal with.

image

Let’s give these machines some names. I’ll pick some suggestive ones for you.

image

Oh hey, you just implemented bind for the continuation monad.

image

Here is the transformation step by step:

m a -> (a -> m b) -> m b
Cont r a -> (a -> Cont r b) -> Cont r b
((a -> r) -> r) -> (a -> ((b -> r) -> r)) -> ((b -> r) -> r)
((a -> r) -> r) -> (a -> (b -> r) -> r) -> (b -> r) -> r

The last step is perhaps the most subtle, but can be done because arrows right associate.

As an exercise, do return :: a -> (a -> r) -> r (wait, that looks kind of familiar…), fmap :: (a -> b) -> ((a -> r) -> r) -> (b -> r) -> r and callCC :: ((a -> (b -> r) -> r) -> (a -> r) -> r) -> (a -> r) -> r (important: that’s a b inside the first argument, not an a !).

This presentation is the game semantic account of intuitionistic logic, though I have elided treatment of negation and quantifiers, which are more advanced topics than the continuation monad, at least in this setting.

8 ways to report errors in Haskell revisited

In 2007, Eric Kidd wrote a quite popular article named 8 ways to report errors in Haskell. However, it has been four years since the original publication of the article. Does this affect the veracity of the original article? Some names have changed, and some of the original advice given may have been a bit… dodgy. We’ll take a look at each of the recommendations from the original article, and also propose a new way of conceptualizing all of Haskell’s error reporting mechanisms.

I recommend reading this article side-to-side with the old article.

1. Use error

No change. My personal recommendation is that you should only use error in cases which imply programmer error; that is, you have some invariant that only a programmer (not an end-user) could have violated. And don’t forget, you should probably see if you can enforce this invariant in the type system, rather than at runtime. It is also good style to include the name of the function which the error is associated with, so you say “myDiv: division by zero” rather than just “Division by zero.”

Another important thing to note is that error e is actually an abbreviation for throw (ErrorCall e), so you can explicitly pattern match against this class of errors using:

import qualified Control.Exception as E

example1 :: Float -> Float -> IO ()
example1 x y =
  E.catch (putStrLn (show (myDiv1 x y)))
          (\(ErrorCall e) -> putStrLn e)

However, testing for string equality of error messages is bad juju, so if you do need to distinguish specific error invocations, you may need something better.

2. Use Maybe a

No change. Maybe is a convenient, universal mechanism for reporting failure when there is only one possible failure mode and it is something that a user probably will want to handle in pure code. You can easily convert a returned Maybe into an error using fromMaybe (error "bang") m. Maybe gives no indication what the error was, so it’s a good idea for a function like head or tail but not so much for doSomeComplicatedWidgetThing.

3. Use Either String a

I can’t really recommend using this method in any circumstance. If you don’t need to distinguish errors, you should have used Maybe. If you don’t need to handle errors while you’re in pure code, use exceptions. If you need to distinguish errors in pure code, for the love of god, don’t use strings, make an enumerable type!

However, in base 4.3 or later (GHC 7), this monad instance comes for free in Control.Monad.Instances; you no longer have to do the ugly Control.Monad.Error import. But there are some costs to having changed this: see below.

4. Use Monad and fail to generalize 1-3

If you at all a theoretician, you reject fail as an abomination that should not belong in Monad, and refuse to use it.

If you’re a bit more practical than that, it’s tougher to say. I’ve already made the case that catching string exceptions in pure code isn’t a particularly good idea, and if you’re in the Maybe monad fail simply swallows your nicely written exception. If you’re running base 4.3, Either will not treat fail specially either:

-- Prior to base-4.3
Prelude Control.Monad.Error> fail "foo" :: Either String a
Loading package mtl-1.1.0.2 ... linking ... done.
Left "foo"

-- After base-4.3
Prelude Control.Monad.Instances> fail "foo" :: Either String a
*** Exception: foo

So you have this weird generalization that doesn’t actually do what you want most of the time. It just might (and even so, only barely) come in handy if you have a custom error handling application monad, but that’s it.

It’s worth noting Data.Map does not use this mechanism anymore.

5. Use MonadError and a custom error type

MonadError has become a lot more reasonable in the new world order, and if you are building your own application monad it’s a pretty reasonable choice, either as a transformer in the stack or an instance to implement.

Contrary to the old advice, you can use MonadError on top of IO: you just transform the IO monad and lift all of your IO actions. I’m not really sure why you’d want to, though, since IO has it’s own nice, efficient and extensible error throwing and catching mechanisms (see below.)

I’ll also note that canonicalizing errors that the libraries you are interoperating is a good thing: it makes you think about what information you care about and how you want to present it to the user. You can always create a MyParsecError constructor which takes the parsec error verbatim, but for a really good user experience you should be considering each case individually.

6. Use throw in the IO monad

It’s not called throwDyn and catchDyn anymore (unless you import Control.OldException), just throw and catch. You don’t even need a Typeable instance; just a trivial Exception instance. I highly recommend this method for unchecked exception handling in IO: despite the mutation of these libraries over time, the designers of Haskell and GHC’s maintainers have put a lot of thought into how this exceptions should work, and they have broad applicability, from normal synchronous exception handling to asynchronous exception handling, which is very nifty. There are a load of bracketing, masking and other functions which you simply cannot do if you’re passing Eithers around.

Make sure you do use throwIO and not throw if you are in the IO monad, since the former guarantees ordering; the latter, not necessarily.

7. Use ioError and catch

No reason to use this, it’s around for hysterical raisins.

8. Go nuts with monad transformers

This is, for all intents and purposes, the same as 5; just in one case you roll your own, and in this case you compose it with transformers. The same caveats apply. Eric does give good advice here shooing you away from using this with IO.


Here are some new mechanisms which have sprung up since the original article was published.

9. Checked exceptions

Pepe Iborra wrote a nifty checked exceptions library which allows you to explicitly say what Control.Exception style exceptions a piece of code may throw. I’ve never used it before, but it’s gratifying to know that Haskell’s type system can be (ab)used in this way. Check it out if you don’t like the fact that it’s hard to tell if you caught all the exceptions you care about.

10. Failure

The Failure typeclass is a really simple library that attempts to solve the interoperability problem by making it easy to wrap and unwrap third-party errors. I’ve used it a little, but not enough to have any authoritative opinion on the matter. It’s also worth taking a look at the Haskellwiki page.

Conclusion

There are two domains of error handling that you need to consider: pure errors and IO errors. For IO errors, there is a very clear winner: the mechanisms specified in Control.Exception. Use it if the error is obviously due to an imperfection in the outside universe. For pure errors, a bit more taste is necessary. Maybe should be used if there is one and only one failure case (and maybe it isn’t even that big of a deal), error may be used if it encodes an impossible condition, string errors may be OK in small applications that don’t need to react to errors, custom error types in those that do. For interoperability problems, you can easily accomodate them with your custom error type, or you can try using some of the frameworks that other people are building: maybe one will some day gain critical mass.

It should be clear that there is a great deal of choice for Haskell error reporting. However, I don’t think this choice is unjustified: each tool has situations which are appropriate for its use, and one joy of working in a high level language is that error conversion is, no really, not that hard.

Joseph and the Amazing Technicolor Box

Consider the following data type in Haskell:

data Box a = B a

How many computable functions of type Box a -> Box a are there? If we strictly use denotational semantics, there are seven:

image

But if we furthermore distinguish the source of the bottom (a very operational notion), some functions with the same denotation have more implementations…

  1. Irrefutable pattern match: f ~(B x) = B x. No extras.
  2. Identity: f b = b. No extras.
  3. Strict: f (B !x) = B x. No extras.
  4. Constant boxed bottom: Three possibilities: f _ = B (error "1"); f b = B (case b of B _ -> error "2"); and f b = B (case b of B !x -> error "3").
  5. Absent: Two possibilities: f (B _) = B (error "4"); and f (B x) = B (x `seq` error "5").
  6. Strict constant boxed bottom: f (B !x) = B (error "6").
  7. Bottom: Three possibilities: f _ = error "7"; f (B _) = error "8"; and f (B !x) = error "9".

List was ordered by colors of the rainbow. If this was hieroglyphics to you, may I interest you in this blog post?

Postscript. GHC can and will optimize f b = B (case b of B !x -> error "3"), f (B x) = B (x `seq` error "5") and f (B !x) = error "9" into alternative forms, because in general we don’t say if seq (error "1") (error "2") is semantically equivalent error "1" or error "2": any one is possible due to imprecise exceptions. But if you really care, you can use pseq. However, even with exception set semantics, there are more functions in this “refined” view of the normal denotational semantics.

Diskless Paxos crash recovery

This is an edited version of an email I sent last week. Unfortunately, it does require you to be familiar with the original Paxos correctness proof, so I haven’t even tried to expand it into something appropriate for a lay audience. The algorithm is probably too simple to be in the literature, except maybe informally mentioned—however, if it is wrong, I would love to know, since real code depends on it.

I would like to describe an algorithm for Paxos crash-recovery that does not require persistent storage, by utilizing synchronized clocks and a lattice-based epoch numbering. The basic idea is to increase the ballot/proposal number to one for which it is impossible for the crashed node to have made any promises for it. Such an algorithm, as noted in Paxos made Live, is useful in the case of disk corruption, where persistent storage is lost. (Unfortunately, the algorithm they describe in the paper for recovering from this situation is incorrect. The reason is left as an exercise for the reader.) It is inspired by Renesse’s remark about an “epoch-based system”, and the epoch-based crash-recovery algorithm described in JPaxos: State Machine Replication Based on the Paxos Protocol. However, in correspondence with Nuno, I discovered that proofs for the correctness of their algorithm had not been published, so I took it upon myself to convince myself of its correctness, and in the process discovered a simpler version. It may be the case that this algorithm is already in the community folklore, in which case all the better, since my primary interest is implementation.

First, let’s extend proposal numbers from a single, namespaced value n to a tuple (e, n), where n is a namespaced proposal number as before, and e is an epoch vector, with length equal to the number of nodes in the Paxos cluster, and the usual Cartesian product lattice structure imposed upon it.

Let’s establish what behavior we’d like from a node during a crash:

KNOWN-UNKNOWNS. An acceptor knows a value e*, for which for all e where e* ≤ e (using lattice ordering), the acceptor knows if it has responded to prepare requests of form (e, n) (for all n).

That is to say, the acceptor knows what set of proposal numbers he is guaranteed not to have made any promises for.

How can we establish this invariant? We might write a value to persistent storage, and then incrementing it upon a crash; this behavior is then established by monotonicity. It turns out we have other convenient sources of monotonic numbers: synchronized clocks (which are useful for Paxos in other contexts) have this behavior. So instead of using a vector of integers, we use a vector of timestamps. Upon a crash, a process sets its epoch to be the zero vector, except for its own entry, which is set to his current timestamp.

In Paxos made Simple, Lamport presents the following invariant on the operation of acceptors:

P1a. An acceptor can accept proposal numbered n iff it has not responded to a prepare request greater than n.

We can modify this invariant to the following:

P1b. An acceptor can accept proposal numbered (e, n) iff e* ≤ e and it has not responded to a prepare request (_, n') with n' > n.

Notice that this invariant “strengthens” P1a in the sense that an acceptor accepts a proposal in strictly less cases (namely, it refuses proposals when e* ≰ e). Thus, safety is preserved, but progress is now suspect.

When establishing progress of Paxos, we require that there exist a stable leader, and that this leader eventually pick a proposal number that is “high enough”. So the question is, can the leader eventually pick a proposal number that is “high enough”? Yes, define this number to be (lub{e}, max{n} + 1). Does this epoch violate KNOWN-UNKNOWNS? No, as a zero vector with a single later timestamp for that node is always incomparable with any epoch the existing system may have converged upon.

Thus, the modifications to the Paxos algorithm are as follows:

  • Extend ballot numbers to include epoch numbers;
  • On initial startup, set e* to be the zero vector, with the current timestamp in this node’s entry;
  • Additionally reject accept requests whose epoch numbers are not greater-than or equal to e*;
  • When selecting a new proposal number to propose, take the least upper bound of all epoch numbers.

An optimization is on non-crash start, initialize e* to be just the zero vector; this eliminates the need to establish an epoch in the first round of prepare requests. Cloning state from a snapshot is an orthogonal problem, and can be addressed using the same mechanisms that fix lagging replicas. We recommend also implementing the optimization in which a leader only send accept messages to a known good quorum, so a recovered node does not immediately force a view change.

I would be remiss if I did not mention some prior work in this area. In particular, in Failure Detection and Consensus in the Crash-Recovery Model, the authors present a remarkable algorithm that, without stable storage, can handle more than the majority of nodes failing simultaneously (under some conditions, which you can find in the paper). Unfortunately, their solution is dramatically more complicated than solution I have described above, and I do not know of any implementations of it. Additionally, an alternate mechanism for handling crashed nodes with no memory is a group membership mechanism. However, group membership is notoriously subtle to implement correctly.

First impressions of module programming

During my time at Jane Street, I’ve done a fair bit of programming involving modules. I’ve touched functors, type and module constraints, modules in modules, even first class modules (though only peripherally). Unfortunately, the chapter on modules in Advanced Topics in Types and Programming Languages made my eyes glaze over, so I can’t really call myself knowledgeable in module systems yet, but I think I have used them enough to have a few remarks about them. (All remarks about convention should be taken to be indicative of Jane Street style. Note: they’ve open sourced a bit of their software, if you actually want to look at some of the stuff I’m talking about.)

The good news is that they basically work the way you expect them to. In fact, they’re quite nifty. The most basic idiom you notice when beginning to use a codebase that uses modules a lot is you see this:

module Sexp = struct
  type t = ...
  ...
end

There is in fact a place where I have seen this style before: Henning Thielemann’s code on Hackage, in particular data-accessor, which I have covered previously. Unlike in Haskell, this style actually makes sense in OCaml, because you never include Sexp (an unqualified import in Haskell lingo) in the conventional sense, you usually refer to the type as Sexp.t. So the basic unit of abstraction can be thought of as a type—and most simple modules are exactly this—but you can auxiliary types and functions that operate on that type. This is pretty simple to understand, and you can mostly parse the module system as a convenient namespacing mechanism.

Then things get fun.

When you use Haskell type classes, each function individually specifies what constraints on the argument there are. OCaml doesn’t have any type classes, so if you want to do that, you have to manually pass the dictionary to a function. You can do that, but it’s annoying, and OCaml programmers think bigger. So instead of passing a dictionary to a function, you pass a module to a functor, and you specialize all of your “generic” functions at once. It’s more powerful, and this power gets over the annoyance of having to explicitly specify what module your using at any given time. Constraints and modules-in-modules fall out naturally from this basic idea, when you actually try to use the module system in practice.

Probably the hardest thing (for me) to understand about the module system is how type inference and checking operate over it. Part of this is the impedance mismatch with how type classes work. When I have a function:

f :: Monoid m => m -> Int -> m

m is a polymorphic value that can get unified with any specific type. So if I do f 5 + 2, that’s completely fair game if I have an appropriate Monoid instance defined for Int (even though + is not a Monoid instance method.)

However, if I do the same trick with modules, I have to be careful about adding extra type constraints to teach the compiler that some types are, indeed, the same. Here is an example of an extra type restriction that feels like it should get unified away, but doesn’t:

module type SIG = sig
    type t
    val t_of_string : string -> t
end

module N : SIG = struct
    type t = string
    let t_of_string x = x
end

let () = print_endline (N.t_of_string "foo")

Actually, you have to specify that t and string are the same when you add that SIG declaration:

module N : SIG with type t = string = struct

Funny! (Actually, it gets more annoying when you’re specifying constraints for large amounts of types, not just one.) It’s also tricky to get right when functors are involved, and there were some bugs in pre-3.12 OCaml which meant that you had to do some ugly things to ensure you could actually write the type constraints you wanted (with type t = t… those ts are different…)

There are some times, however, when you feel like you would really, really like typeclasses in OCaml. Heavily polymorphic functionality tends to be the big one: if you have something like Sexpable (types that can be converted into S-expressions), using the module system feels very much like duck typing: if it has a sexp_of_t function, and it’s typed right, it’s “sexpable.” Goodness, most of the hairy functors in our base library are because we need to handle the moral equivalent of multiparameter type classes.

Monadic bind is, of course, hopeless. Well, it works OK if you’re only using one monad in your program (then you just specialize your >>= to that module’s implementation by opening the module). But in most applications you’re usually in one specific monad, and if you want to quickly drop into the option monad you’re out of luck. Or you could redefine the operator to be >>=~ and hope no one stabs you. :-)

In-program GC stats redux

Hac Phi was quite productive (since I managed to get two blog posts out of it!) On Saturday I committed a new module GHC.Stats to base which implemented a modified subset of the API I proposed previously. Here is the API; to use it you’ll need to compile GHC from Git. Please test and let me know if things should get changed or clarified!

-- | Global garbage collection and memory statistics.
data GCStats = GCStats
    { bytes_allocated :: Int64 -- ^ Total number of bytes allocated
    , num_gcs :: Int64 -- ^ Number of garbage collections performed
    , max_bytes_used :: Int64 -- ^ Maximum number of live bytes seen so far
    , num_byte_usage_samples :: Int64 -- ^ Number of byte usage samples taken
    -- | Sum of all byte usage samples, can be used with
    -- 'num_byte_usage_samples' to calculate averages with
    -- arbitrary weighting (if you are sampling this record multiple
    -- times).
    , cumulative_bytes_used :: Int64
    , bytes_copied :: Int64 -- ^ Number of bytes copied during GC
    , current_bytes_used :: Int64 -- ^ Current number of live bytes
    , current_bytes_slop :: Int64 -- ^ Current number of bytes lost to slop
    , max_bytes_slop :: Int64 -- ^ Maximum number of bytes lost to slop at any one time so far
    , peak_megabytes_allocated :: Int64 -- ^ Maximum number of megabytes allocated
    -- | CPU time spent running mutator threads.  This does not include
    -- any profiling overhead or initialization.
    , mutator_cpu_seconds :: Double
    -- | Wall clock time spent running mutator threads.  This does not
    -- include initialization.
    , mutator_wall_seconds :: Double
    , gc_cpu_seconds :: Double -- ^ CPU time spent running GC
    , gc_wall_seconds :: Double -- ^ Wall clock time spent running GC
    -- | Number of bytes copied during GC, minus space held by mutable
    -- lists held by the capabilities.  Can be used with
    -- 'par_max_bytes_copied' to determine how well parallel GC utilized
    -- all cores.
    , par_avg_bytes_copied :: Int64
    -- | Sum of number of bytes copied each GC by the most active GC
    -- thread each GC.  The ratio of 'par_avg_bytes_copied' divided by
    -- 'par_max_bytes_copied' approaches 1 for a maximally sequential
    -- run and approaches the number of threads (set by the RTS flag
    -- @-N@) for a maximally parallel run.
    , par_max_bytes_copied :: Int64
    } deriving (Show, Read)

-- | Retrieves garbage collection and memory statistics as of the last
-- garbage collection.  If you would like your statistics as recent as
-- possible, first run a 'performGC' from "System.Mem".
getGCStats :: IO GCStats