Inside 206-105

Existential Pontification and Generalized Abstract Digressions

Ott ⇔ PLT Redex

Ott and PLT Redex are a pair of complimentary tools for the working semanticist. Ott is a tool for writing definitions of programming languages in a nice ASCII notation, which then can be typeset in LaTeX or used to generate definitions for a theorem prover (e.g. Coq). PLT Redex is a tool for specifying and debugging operational semantics. Both tools are easy to install, which is a big plus. Since the tools are quite similar, I thought it might be interesting to do a comparison of how various common tasks are done in both languages. (Also, I think the Redex manual is pretty terrible.)

Variables. In Ott, variables are defined by way of metavariables (metavar x), which then serve as variable (by either using the metavariable alone, or suffixing it with a number, index variable or tick).

In Redex, there is no notion of a metavariable; a variable is just another production. There are a few different ways say that a production is a variable: the simplest method is to use variable-not-otherwise-mentioned, which automatically prevents keywords from acting as variables. There are also several other variable patterns variable, variable-except and variable-prefix, which afford more control over what symbols are considered variables. side-condition may also be useful if you have a function which classifies variables.

Grammar. Both Ott and Redex can identify ambiguous matches. Ott will error when it encounters an ambiguous parse. Redex, on the other hand, will produce all valid parses; while this is not so useful when parsing terms, it is quite useful when specifying non-deterministic operational semantics (although this can have bad performance implications). check-redundancy may be useful to identify ambiguous patterns.

Binders. In Ott, binders are explicitly declared in the grammar using bind x in t; there is also a binding language for collecting binders for pattern-matching. Ott can also generate substitution/free variable functions for the semantics. In Redex, binders are not stated in the grammar; instead, they are implemented solely in the reduction language, usually using substitution (Redex provides a workhorse substitution function for this purpose), and explicitly requiring a variable to be fresh. Redex does have a special-form in the metalanguage for doing let-binding (term-let), which substitutes immediately.

Lists. Ott supports two forms of lists: dot forms and list comprehensions. A dot form looks like x1 , .. , xn and requires an upper bound. A list comprehension looks like </ xi // i IN 1 .. n />; the bounds can be omitted. A current limitation of Ott is that it doesn’t understand how to deal with nested dot forms, this can be worked around by doing a comprension over a production, and then elsewhere stating the appropriate equalities the production satisfies.

Redex supports lists using ellipsis patterns, which looks like (e ...). There is no semantic content here: the ellipses simply matches zero or more copies of e, which can lead to nondeterministic matches when there are multiple ellipses. Nested ellipses are supported, and simply result in nested lists. Bounds can be specified using side-conditions; however, Redex supports a limited form of bounding using named ellipses (e.g. ..._1), where all ellipses with the same name must have the same length.

Semantics. Ott is agnostic to whatever semantics you want to define; arbitrary judgments can be specified. One can also define judgments as usual in Redex, but Redex provides special support for evaluation semantics, in which a semantics is given in terms of evaluation contexts, thus allowing you to avoid the use of structural rules. So a usual use-case is to define a normal expression language, extend the language to have evaluation contexts, and then define a reduction-relation using in-hole to do context decomposition. The limitation is that if you need to do anything fancy (e.g. multi-hole evaluation contexts), you will have to fall back to judgment forms.

Type-setting. Ott supports type-setting by translation into LaTeX. Productions can have custom LaTeX associated with them, which is used to generate their output. Redex has a pict library for directly typesetting into PDF or Postscript; it doesn’t seem like customized typesetting is an intended use-case for PLT Redex, though it can generate reasonable Lisp-like output.

Conclusion. If I had to say what the biggest difference between Ott and PLT Redex was, it is that Ott is primarily concerned with the abstract semantic meaning of your definitions, whereas PLT Redex is primarily concerned with how you would go about matching against syntax (running it). One way to see this is in the fact that in Ott, your grammar is a BNF, which is fed into a CFG parser; whereas in PLT Redex, your grammar is a pattern language for the pattern-matching machine. This should not be surprising: one would expect each tool’s design philosophy to hew towards their intended usage.

When a lock is better than an MVar

MVars are an amazingly flexible synchronization primitive, which can serve as locks, one-place channels, barriers, etc. or be used to form higher-level abstractions. As far as flexibility is concerned, MVars are the superior choice of primitive for the runtime system to implement—as opposed to just implementing, say, a lock.

However, I was recently thinking about GHC's BlockedIndefinitelyOnMVar exception, and it occurred to me that a native implementation of locks could allow perfect deadlock detection, as opposed to the approximate detection for MVars we currently provide. (I must emphasize, however, that here, I define deadlock to mean a circular waits-for graph, and not “thread cannot progress further.”)

Here is how the new primitive would behave:

  • There would be a new type Lock, with only one function withLock :: Lock -> IO a -> IO a. (For brevity, we do not consider the generalization of Lock to also contain a value.)
  • At runtime, the lock is represented as two closure types, indicating locked and unlocked states. The locked closure contains a waiting queue, containing threads which are waiting for the lock.
  • When a thread takes out a free lock, it adds the lock to a (GC'd) held locks set associated with the thread. When it returns the lock, the lock is removed from this set.
  • When a thread attempts to take a busy lock, it blocks itself (waiting for a lock) and adds itself to the waiting queue of the locked closure.
  • Critically, references to the lock are treated as weak pointers when the closure is locked. (Only the pointer from the held lock set is strong.) Intuitively, just because a pointer to the lock doesn’t mean you can unlock; the only person who can unlock it is the thread who has the lock in their held locks set.
  • If a thread attempts to take out a lock on a dead weak pointer, it is deadlocked.

Theorem. Any set of threads in a waits-for cycle is unreachable, if there are no other pointers to thread besides the pointer from the waiting queue of the locks in the cycle.

Proof. Consider a single thread in the cycle: we show that the only (strong) pointer to it is from the previous thread in the cycle. When a thread is blocked, it is removed from the run queue (which counts as a GC root). Given the assumption, the only pointer to the thread is from the waiting queue of the lock it is blocked on. We now consider pointers to the lock it is blocked on. As this lock is busy, all pointers to it are weak, except for the pointer from the thread which is holding the lock. But this is exactly the previous thread in the cycle. ■

At the cost of a weak-pointer dereference when a lock is taken out, we can now achieve perfect deadlock detection. Deadlock will be detected as soon as a garbage collection runs that detects the dead cycle of threads. (At worst, this will be the next major GC.)

Why might this be of interest? After all, normally, it is difficult to recover from a deadlock, so while accurate deadlock reporting might be nice-to-have, it is by no means necessary. One clue comes from a sentence in Koskinen and Herlihy's paper Dreadlocks: Efficient Deadlock Detection: “an application that is inherently capable of dealing with abortable lock software transactional memory (STM).” If you are in an STM transaction, deadlock is no problem at all; just rollback one transaction, breaking the cycle. Normally, one does not take out locks in ordinary use of STM, but this can occur when you are using a technique like transactional boosting (from the same authors; the relationship between the two papers is no coincidence!)

Exercise for the reader, formulate a similar GC scheme for MVars restricted to be 1-place channels. (Hint: split the MVar into a write end and a read end.)

So you want to add a new concurrency primitive to GHC…

One of the appealing things about GHC is that the compiler is surprisingly hackable, even when you don’t want to patch the compiler itself. This hackability comes from compiler plugins, which let you write custom optimization passes on Core, as well as foreign primops, which let you embed low-level C-- to manipulate the low-level representation of various primitives. These hooks let people implement and distribute features that would otherwise be to unstable or speculative to put into the compiler proper.

A particular use-case that has garnered some amount of interest recently is that of concurrency primitives. We engineers like to joke that, in the name of performance, we are willing to take on nearly unbounded levels of complexity: but this is almost certainly true when it comes to concurrency primitives, where the use of ever more exotic memory barriers and concurrent data structures can lead to significant performance boosts (just ask the Linux kernel developers). It’s very tempting to look at this situation and think, “Hey, we could implement this stuff in GHC too, using the provided compiler hooks!” But there are a lot of caveats involved here.

After answering a few questions related to this subject on the ghc-devs list and noticing that many of the other responses were a bit garbled, I figured I ought to expand on my responses a bit in a proper blog post. I want to answer the following questions:

  1. What does it mean to have a memory model for a high-level language like Haskell? (You can safely skip this section if you know what a memory model is.)
  2. What is (GHC) Haskell’s memory model?
  3. How would I go about implementing a (fast) memory barrier in GHC Haskell?

Memory models are semantics

What is a memory model? If you ask a hardware person, they might tell you, “A memory model is a description of how a multi-processor CPU interacts with its memory, e.g. under what circumstances a write by one processor is guaranteed to be visible by another.” If you ask a compiler person, they might tell you, “A memory model says what kind of compiler optimizations I’m allowed to do on operations which modify shared variables.” A memory model must fulfill both purposes (a common misconception is that it is only one or the other). To be explicit, we define a memory model as follows (adapted from Adve-Boehm):

A memory model is a semantics for shared variables, i.e. the set of values that a read in a program is allowed to return.

That’s right: a memory model defines the behavior of one the most basic operations in your programming language. Without it, you can’t really say what your program is supposed to do.

Why, then, are memory models so rarely discussed, even in a language community that is so crazy about semantics? In the absence of concurrency, the memory model is irrelevant: the obvious semantics apply. In the absence of data races, the memory model can be described quite simply. For example, a Haskell program which utilizes only MVars for inter-thread communication can have its behavior described completely using a relatively simple nondeterministic operational semantics (see Concurrent Haskell paper (PS)); software transactional memory offers high-level guarantees of atomicity with respect to reads of transactional variables. Where a memory model becomes essential is when programs contain data races: when you have multiple threads writing and reading IORefs without any synchronization, a memory model is responsible for defining the behavior of this program. With modern processors, this behavior can be quite complex: we refer to these models as relaxed memory models. Sophisticated synchronization primitives will often take advantage of a relaxed memory model to avoid expensive synchronizations and squeeze out extra performance.

GHC Haskell’s memory (non) model

One might say the Haskell tradition is one that emphasizes the importance of semantics... except for a number of notable blind spots. The memory model is one of those blind spots. The original Haskell98 specification did not contain any specification of concurrency. Concurrent Haskell paper (PS) gave a description of semantics for how concurrency might be added to the language, but the paper posits only the existence of MVars, and is silent on how MVars ought to interact with IORefs.

One of the very first discussions that took place on the haskell-prime committee when it was inaugurated in 2006 was whether or not Concurrent Haskell should be standardized. In the discussion, it was quickly discovered that a memory model for IORefs would be needed (continued here). As of writing, no decision has been made as to whether or not IORefs should have a strong or weak memory model.

The upshot is that, as far as Haskell the standardized language goes, the behavior here is completely undefined. To really be able to say anything, we’ll have to pick an implementation (GHC Haskell), and we’ll have to infer which aspects of the implementation are specified behavior, as opposed to things that just accidentally happen to hold. Notably, memory models have implications for all levels of your stack (it is a common misconception that a memory barrier can be used without any cooperation from your compiler), so to do this analysis we’ll need to look at all of the phases of the GHC compilation chain. Furthermore, we’ll restrict ourselves to monadic reads/writes, to avoid having to wrangle with the can of worms that is laziness.

Here’s GHC’s compilation pipeline in a nutshell:


At the very top of the compiler pipeline lie the intermediate languages Core and STG. These will preserve sequential consistency with no trouble, as the ordering of reads and writes is fixed by the use of monads, and preserved throughout the desugaring and optimization passes: as far as the optimizer is concerned, the primitive operations which implement read/write are complete black boxes. In fact, monads will over-sequentialize in many cases! (It is worth remarking that rewrite rules and GHC plugins could apply optimizations which do not preserve the ordering imposed by monads. Of course, both of these facilities can be used to also change the meaning of your program entirely; when considering a memory model, these rules merely have a higher burden of correctness.)

The next step of the pipeline is a translation into C--, a high-level assembly language. Here, calls to primitive operations like readMutVar# and writeMutVar# are translated into actual memory reads and writes in C--. Importantly, the monadic structure that was present in Core and STG is now eliminated, and GHC may now apply optimizations which reorder reads and writes. What actually occurs is highly dependent on the C-- that is generated, as well as the optimizations that GHC applies, and C-- has no memory model, so we cannot appeal to even that.

This being said, a few things can be inferred from a study of the optimization passes that GHC does implement:

  • GHC reserves the right to reorder stores: the WriteBarrier mach-op (NB: not available from Haskell!) is defined to prevent future stores from occurring before preceding stores. In practice, GHC has not implemented any C-- optimizations which reorder stores, so if you have a story for dealing with the proceeding stages of the pipeline, you can dangerously assume that stores will not be reordered in this phase.
  • GHC reserves the right to reorder loads, and does so extensively. One of the most important optimizations we perform is a sinking pass, where assignments to local variables are floated as close to their use-sites as possible. As of writing, there is no support for read barrier, which would prevent this floating from occurring.

There are a few situations where we happen to avoid read reordering (which may be dangerously assumed):

  • Reads don’t seem to be reordered across foreign primops (primops defined using the foreign prim keywords). This is because foreign primops are implemented as a jump to another procedure (the primop), and there are no inter-procedural C-- optimizations at present.
  • Heap reads don’t seem to be reordered across heap writes. This is because we currently don’t do any aliasing analysis and conservatively assume the write would have clobbered the read. (This is especially dangerous to assume, since you could easily imagine getting some aliasing information from the frontend.)

Finally, the C-- is translated into either assembly (via the NCG—N for native) or to LLVM. During translation, we convert the write-barrier mach-op into an appropriate assembly instruction (no-op on x86) or LLVM intrinsic (sequential consistency barrier); at this point, the behavior is up to the memory model defined by the processor and/or by LLVM.

It is worth summarizing the discussion here by comparing it to the documentation at Data.IORef, which gives an informal description of the IORef memory model:

In a concurrent program, IORef operations may appear out-of-order to another thread, depending on the memory model of the underlying processor architecture...The implementation is required to ensure that reordering of memory operations cannot cause type-correct code to go wrong. In particular, when inspecting the value read from an IORef, the memory writes that created that value must have occurred from the point of view of the current thread.

In other words, “We give no guarantees about reordering, except that you will not have any type-safety violations.” This behavior can easily occur as a result of reordering stores or loads. However, the type-safety guarantee is an interesting one: the last sentence remarks that an IORef is not allowed to point to uninitialized memory; that is, we’re not allowed to reorder the write to the IORef with the write that initializes a value. This holds easily on x86, due to the fact that C-- does not reorder stores; I am honestly skeptical that we are doing the right thing on the new code generator for ARM (but no one has submitted a bug yet!)

What does it all mean?

This dive into the gory internals of GHC is all fine and nice, but what does it mean for you, the prospective implementor of a snazzy new concurrent data structure? There are three main points:

  1. Without inline foreign primops, you will not be able to convince GHC to emit the fast-path assembly code you are looking for. As we mentioned earlier, foreign primops currently always compile into out-of-line jumps, which will result in a bit of extra cost if the branch predictor is unable to figure out the control flow. On the plus side, any foreign primop call will accidentally enforce the compiler-side write/read barrier you are looking for.
  2. With inline foreign primops, you will still need make modifications to GHC in order to ensure that optimization passes respect your snazzy new memory barriers. For example, John Lato’s desire for a load-load barrier (the email which kicked off this post) will be fulfilled with no compiler changes by a out-of-line foreign primop, but not by the hypothetical inline foreign primop.
  3. This stuff is really subtle; see the position paper Relaxed memory models must be rigorous, which argues that informal descriptions of memory models (like this blog post!) are far too vague to be useful: if you want to have any hope of being correct, you must formalize it! Which suggests an immediate first step: give C-- a memory model. (This should be a modest innovation over the memory models that C and C++ have recently received.)

For the rest of us, we’ll use STM instead, and be in a slow but compositional and dead-lock free nirvana.

Two bugs in the borrow checker every Rust developer should know about

Apologies in advance: this post assumes familiarity with Rust.

Anyone who has done some coding in Rust may be familiar with the dreaded borrow checker, famous for obstructing the compilation of otherwise “perfectly reasonable code.” In many cases, the borrow checker is right: you’re writing your code wrong, and there is another, clearer way to write your code that will appease the borow checker. But sometimes, even after you’ve skimmed the tutorial, memorized the mantra “a &mut pointer is the only way to mutate the thing that it points at” and re-read the borrowed pointers tutorial, the borrow-checker might still stubbornly refuse to accept your code.

If that’s the case, you may have run into one of the two (in)famous bugs in the borrow-checker. In this post, I want to describe these two bugs, give situations where they show up and describe some workarounds. This is the kind of post which I hope becomes obsolete quickly, but the fixes for them are pretty nontrivial, and you are inevitably going to run into these bugs if you try to program in Rust today.

Mutable borrows are too eager (#6268)

Summary. When you use &mut (either explicitly or implicitly), Rust immediately treats the lvalue as borrowed and imposes its restrictions (e.g. the lvalue can’t be borrowed again). However, in many cases, the borrowed pointer is not used until later, so imposing the restrictions immediately results in spurious errors. This situation is most likely to occur when there is an implicit use of &mut. (Bug #6268)

Symptoms. You are getting the error “cannot borrow `foo` as immutable because it is also borrowed as mutable”, but the reported second borrow is an object dispatching a method call, or doesn’t seem like it should have been borrowed at the time the flagged borrow occured.

Examples. The original bug report describes the situation for nested method calls, where the outer method call has &mut self in its signature:

fn main() {
  let mut map = std::hashmap::HashMap::new();
  map.insert(1, 2);
  map.insert(2, *map.get(&1)); // XXX
} 4:20 error: cannot borrow `map` as immutable because it is also borrowed as mutable   map.insert(2, *map.get(&1)); // XXX
                           ^~~ 4:5 note: second borrow of `map` occurs here   map.insert(2, *map.get(&1)); // XXX

This code would like to retrieve the value at key 1 and store it in key 2. Why does it fail? Consider the signature fn insert(&mut self, key: K, value: V) -> bool: the insert method invocation immediately takes out a &mut borrow on map before attempting to evaluate its argument. If we desugar the method invocation, the order becomes clear: HashMap::insert(&mut map, 2, *map.get(&1)) (NB: this syntax is not implemented yet). Because Rust evaluates arguments left to right, this is equivalent to:

let x_self : &mut HashMap<int> = &mut map;
let x_arg1 : int = 2;
let x_arg2 : int = *map.get(&1); // XXX
HashMap::insert(x_self, x_arg1, x_arg2);

meaning there is an active borrow by the time we call map.get. A minor rewrite resolves the problem:

fn main() {
  let mut map = std::hashmap::HashMap::new();
  map.insert(1, 2);
  let x = *map.get(&1);
  map.insert(2, x);

Sensitivity to order of arguments even when no method invocation is involved. Here is another example in which there is no method invocation:

fn g(x: &mut int) -> int { *x }
fn f(x: &mut int, y: int) { *x += y; }
fn main() {
    let mut a = 1;
    f(&mut a, g(&mut a));

Discussion. Fortunately, this bug is pretty easy to work around, if a little annoying: move all of your sub-expressions to let-bindings before the ill-fated mutable borrow (see examples for a worked example). Note: the borrows that occur in these sub-expressions really do have to be temporary; otherwise, you have a legitimate “cannot borrow mutable twice” error on your hands.

Borrow scopes should not always be lexical (#6393)

Summary. When you borrow a pointer, Rust assigns it a lexical scope that constitutes its lifetime. This scope can be as small as a single statement, or as big as an entire function body. However, Rust is unable to calculate lifetimes that are not lexical, e.g. a borrowed pointer is only live until halfway through a function. As a result, borrows may live longer than users might expect, causing the borrow checker to reject some statements. (Bug #6393)

Symptoms. You are getting a “cannot borrow foo as immutable/mutable because it is also borrowed as immutable/mutable”, but you think the previous borrow should have already expired.

Examples. This problem shows up in a variety of situations. The very simplest example which tickles this bug can be seen here:

fn main() {
    let mut x = ~1;
    let y = &mut *x;
    *y = 1;
    let z = &mut *x;
    *z = 1;
} 5:19 error: cannot borrow `*x` as mutable more than once at a time     let z = &mut *x;
                      ^~~~~~~ 3:19 note: second borrow of `*x` as mutable occurs here     let y = &mut *x;

Clearly y is dead after *y = 1, but the borrow checker can’t see that. Fortunately, in this case it is very easy to add a new lexical scope to solve the problem:

fn main() {
    let mut x = ~1;
        let y = &mut *x;
        *y = 1;
    let z = &mut *x;
    *z = 1;

So, when does this actually become a problem? The usual culprit is match statements. Here is some common code involving maps that you might want to write:

extern mod extra;
fn main() {
    let mut table = extra::treemap::TreeMap::new();
    let key = ~"test1";
    match table.find_mut(&key) {
        None    => table.insert(key.clone(), ~[1]), // XXX
        Some(v) => { v.push(1); false }
} 6:24 error: cannot borrow `table` as mutable more than once at a time         None    => table.insert(key.clone(), ~[1]), // XXX
                             ^~~~~ 5:15 note: second borrow of `table` as mutable occurs here     match table.find_mut(&key) {

table is a map of integer keys to vectors. The code performs an insert at key: if the map has no entry, then we create a new singleton vector and insert it in that location; otherwise, it just pushes the value 1 onto the existing vector. Why is table borrowed in the None branch? Intuitively, the borrow for table.find_mut should be dead, since we no longer are using any of the results; however, to Rust, the only lexical scope it can assign the borrowed pointer encompasses the entire match statement, since the borrowed pointer continues to be used in the Some branch (note that if the Some branch is removed, this borrow checks). Unfortunately, it’s not possible to insert a new lexical scope, as was possible in the previous example. (At press time, I wasn’t able to find a small example that only used if.)

Sometimes, the lifetime associated with a variable can force it to be assigned to a lexical scope that is larger than you would expect. Issue #9113 offers a good example of this (code excerpted below):

pub fn read1<'a>(&'a mut self, key: int) -> Option<&'a Data> {
    match self.cache.find(&key) {
        Some(data) => return Some(data),
        None => ()
    match self.db.find(&key) {
        Some(data) => {
            let result: &Data = self.cache.find_or_insert(key, data.clone());
        None => None
} 22:46 error: cannot borrow `(*self).cache` as mutable because it is also borrowed as immutable                 let result: &Data = self.cache.find_or_insert(key, data.clone());
                                               ^~~~~~~~~~ 15:24 note: second borrow of `(*self).cache` occurs here         match self.cache.find(&key) {

This code is attempting to perform a database lookup; it first consults the cache and returns a cached entry if available. Otherwise, it looks for the value in the database, caching the value in the process. Ordinarily, you would expect the borrow of self.cache in the first match to extend only for the first expression. However, the return statement throws a spanner in the works: it forces the lifetime of data to be 'a, which encompasses the entire function body. The borrow checker then concludes that there is a borrow everywhere in the function, even though the function immediately returns if it takes out this borrow.

Discussion. The workaround depends on the nature of the scope that is causing trouble. When match is involved, you can usually arrange for the misbehaving borrow to be performed outside of the match statement, in a new, non-overlapping lexical scope. This is easy when the relevant branch does not rely on any variables from the pattern-match by using short-circuiting control operators:

extern mod extra;
use extra::treemap::TreeMap;
fn main() {
    let mut table: TreeMap<~str,~[int]> = TreeMap::new();
    let key = ~"test1";
    match table.find_mut(&key) {
        None    => {},
        Some(v) => { v.push(1); return }
    table.insert(key.clone(), ~[1]); // None-case

Alternately, instead of directly returning, the match can assign a boolean to indicate whether or not the None-case should be run:

extern mod extra;
use extra::treemap::TreeMap;
fn main() {
    let mut table: TreeMap<~str,~[int]> = TreeMap::new();
    let key = ~"test1";
    let is_none = match table.find_mut(&key) {
        None    => true,
        Some(v) => { v.push(1); false }
    if is_none {
        table.insert(key.clone(), ~[1]);

The boolean can be elaborated into an enum that holds any non-references from the pattern-match you might need. Note that this will not work for borrowed references; but in that case, the borrow truly was still live!

It is a bit more difficult to workaround problems regarding lifetimes, since there is nowhere in the function the pointer is not “borrowed”. One trick which can work in some situations is to convert the function to continuation passing style: that is, instead of returning the borrowed pointer, accept a function argument which gets invoked with the function. pnkfelix describes how you might go about fixing the third example. This removes the lifetime constraint on the variable and resolves the problem.

The lexical scope assigned to a borrow can be quite sensitive to code pertubation, since removing a use of a borrow can result in Rust assigning a (much) smaller lexical scope to the borrow, which can eliminate the error. Sometimes, you can avoid the problem altogether by just avoiding a borrow.


To sum up:

  1. Bug #6268 can cause borrows to start too early (e.g. in method invocations), work around it by performing temporary borrows before you do the actual borrow.
  2. Bug #6393 can cause borrows to end too late (e.g. in match statements), work around it by deferring operations that need to re-borrow until the original lexical scope ends.

Keep these in mind, and you should be able to beat the borrow checker into submission. That is, until Niko fixes these bugs.

Visualizing a block allocator

GHC’s block allocator is a pretty nifty piece of low-level infrastructure. It offers a much more flexible way of managing a heap, rather than trying to jam it all in one contiguous block of memory, and is probably something that should be of general interest to anyone who is implementing low-level code like a runtime. The core idea behind it is quite old (BIBOP: Big Bag of Pages), and is useful for any situation where you have a number of objects that are tagged with the same descriptor, and you don’t want to pay the cost of the tag on each object.

Managing objects larger than pages is a bit tricky, however, and so I wrote a document visualizing the situation to help explain it to myself. I figured it might be of general interest, so you can get it here:

Some day I’ll convert it into wikiable form, but I don’t feel like Gimp'ing the images today...

Xmonad and media keys on Saucy

Ubuntu continues on its rampage of breaking perfectly good software, and on my most recent upgrade to Saucy Salamander, I discovered to my dismay that my media keys (e.g. volume keys, fn (function) keys, suspend button, etc) had stopped working. Of course, it worked fine if I logged into my user using Unity, but who wants to use a silly window manager like that...

The root problem, according to these Arch Linux forum posts is that Gnome has moved media-key support out of gnome-settings-daemon (which any self-respecting Xmonad user is sure to spawn) and into their window manager proper. Which, of course, is no good because I don’t want to use their window manager!

For now, it seems the simplest method of bringing back this functionality is to run a 3.6 version of gnome-settings-daemon. Fortunately, at least for Saucy, there are a few builds of 3.6 available before they upgraded to 3.8. So, all you need to do is grab these two deb files appropriate for your architecture (you need gnome-control-center too, because it has a dependency on gnome-settings-daemon):

Once you've downloaded the appropriate deb files, a dpkg -i $DEBFILE and then apt-mark hold gnome-control-center gnome-settings-daemon should do the trick. You should run an aptitude upgrade to make sure you haven't broken any other dependencies (for example, gnome-shell). (Power-users can add the debs to a local repo and then downgrade explicitly from apt-get.)

Moving forward, we will probably be forced to reimplement media key bindings in some other software package, and it would be nice if this could be standardized in some way. Linux Mint has already forked gnome-settings-daemon, with their cinnamon-settings-daemon, but I've not tried it, and have no idea how well it works.

Update. Trusty has an updated version of this package which restores support, so I am providing backports via my PPA.

If you’re using lift, you’re doing it wrong (probably)

David Darais asked me to make this public service announcement: If you're using lift, you're doing it wrong. This request was prompted by several talks at ICFP about alternatives to monad transformers in Haskell, which all began their talk with the motivation, "Everyone hates lifting their operations up the monad stack; therefore, we need another way of organizing effects." This StackOverflow question describes the standard technique that mtl uses to remove the use of lift in most monadic code.

Now, as most things go, the situation is a bit more nuanced than just "never use lift", and a technically incorrect quip at the beginning of a talk does not negate the motivation behind other effect systems. Here are some of the nuances:

  • As everyone is well aware, when a monad transformer shows up multiple times in the monad stack, the automatic type class resolution mechanism doesn't work, and you need to explicitly say which monad transformer you want to interact with.
  • This mechanism only works if the monadic operations you are interacting with are suitably generalized to begin with, e.g. MonadReader a m => m a rather than Monad m => ReaderT m a or Reader a. This is especially evident for the IO monad, where most people have not generalized their definitions to MonadIO. Fortunately, it is generally the case that only one liftIO is necessary.

And of course, there are still many reasons why you would want to ditch monad transformers:

  • Type-class instances are inherently unordered, and thus a generalized MonadCont m, MonadState m => m a monadic value says nothing about what order the two relevant monads are composed. But the order of this composition has an important semantic effect on how the monad proceeds (does the state transfer or reset over continuation jumps). Thus, monad transformers can have subtle interactions with one another, when sometimes you want non-interfering effects that are truly commutative with one another. And indeed, when you are using the type class approach, you usually use only monads that commute with one another.
  • The interference between different monad transformers makes it difficult to lift certain functions. For example, the type of mask :: ((forall a. IO a -> IO a) -> IO b) -> IO b. If we think operationally what has to happen when IO is composed with State, the lifter has to some how arrange for the state to transfer all the way into the code that runs with exceptions restored. That's very tricky to do in a general way. It gets even worse when these callbacks are invoked multiple times.
  • At the end of the day, while the use of type classes makes the monad stack somewhat abstract and allows the elision of lifts, most of this code is written with some specific monad stack in mind. Thus, it is very rare for nontrivial programs to make use of multiple effects in a modular way, or for effects to be instantiated (i.e. a concrete monad selected) without concretizing the rest of the monad stack.

Monad transformers have problems, let's argue against them for the right reasons!

Of Monadic Fixpoints and Heap Offsets

Here at ICFP, sometimes the so-called “hallway track” is sometimes just as important as the ordinary track. Johan Tibell was wanting to avoid an out-of-line call to allocate function in GHC when a small array of statically known size was allocated. But he found the way that GHC's new code generator handles heap allocation a bit confusing, and so we skipped out of one session today to work it out. In this post, I would like to explain how the code generation monad figures out what the heap offsets in the code are, by way of a kind of cute (and also slightly annoying) trick involving a “monadic” fixpoint.

First, some background about the code generator. The big overall pattern of a function that GHC has to generate code for is something like:

  1. Check if there is enough heap space, if not GC,
  2. Write a bunch of data to the heap,
  3. Push some things to the stack,
  4. Jump to the appropriate continuation.

Concretely, the code will be along the lines of:

    _s28e::P64 = R2;
    // Check if there is enough heap space
    Hp = Hp + 40;
    if (Hp > HpLim) goto c2ET; else goto c2ES;
    // If not enough space, GC
    HpAlloc = 40;
    R2 = _s28e::P64;
    R1 = withEmpty_riC_static_closure;
    call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
    // Write a bunch of data to the heap
    I64[Hp - 32] = sat_s28f_info;
    _c2EG::P64 = Hp - 32;
    I64[Hp - 16] = :_con_info;
    P64[Hp - 8] = _c2EG::P64;
    P64[Hp] = _s28e::P64;
    _c2EO::P64 = Hp - 14;
    R1 = _c2EO::P64;
    // No stack updates this time
    // Jump to the continuation
    call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

This seems reasonable, but how does one go about actually generating this code? The code is generated in order, but the amount of heap that needs to be checked is not known until we've finished laying out the rest of the code. If we put on our mutation hats, we might say, “Well, leave it out for now, and then mutate it in when you know the actual value”, but there is still the knotty question of what the offsets should be when we are writing values to the heap. Notice that in the above code, we only bump the heap pointer once; if we repeatedly bump the heap pointer, then the offsets are easy to calculate, but we are wasting instructions; x86 addressing modes support writing to a register plus some offset directly.

Let’s take a look what GHC does when it allocates a dynamic closure to the heap (simplified):

allocDynClosureCmm info_tbl args_offsets
  = do  { virt_hp <- getVirtHp
        ; let rep = cit_rep info_tbl -- cit = c info table
              info_offset = virt_hp + 1 -- virtual heap offset of first word of new object
              info_ptr = CmmLit (CmmLabel (cit_lbl info_tbl))
        ; base <- getHpRelOffset (virt_hp + 1)
        ; emitSetDynHdr base info_ptr
        ; let (args, offsets) = unzip args_offsets
        ; hpStore base args offsets
        ; setVirtHp (virt_hp + heapClosureSize rep)
        ; getHpRelOffset info_offset

In words, it:

  1. Retrieves a “virtual heap pointer” (more on this later),
  2. Gets the true Hp - n expression (base) using the virtual heap pointer (getHpRelOffset, N.B. the off-by-one),
  3. Emits a bunch of writes to the memory at base (emitSetDynHdr and hpStore),
  4. Bumps the virtual Hp up with the size of the just allocated closure,
  5. Returns the Hp - n expression.

As it turns out, the virtual heap pointer is just an ordinary state variable in the code generation monad FCode (it’s good to take a look at the implementation of the monad you’re using!):

newtype FCode a = FCode (CgInfoDownwards -> CgState -> (# a, CgState #))

data CgState
  = MkCgState { ...
     cgs_hp_usg  :: HeapUsage,
     ... }

data HeapUsage =
  HeapUsage {
        virtHp :: VirtualHpOffset, -- Virtual offset of highest-allocated word
                                   --   Incremented whenever we allocate
        realHp :: VirtualHpOffset  -- realHp: Virtual offset of real heap ptr
                                   --   Used in instruction addressing modes

So virtHp just marches upwards as we allocate things; it is, in effect, the contents of the Hp register in our inefficient, rebumping implementation.

Which leaves us with the pressing question, what is realHp? Well, it starts off as zero (since the offset of the real heap pointer is just zero), but once we bump the heap pointer to do the stack check, it is now precisely the amount of heap we did the heap check for. Calling back our example:

    _s28e::P64 = R2;
    // Check if there is enough heap space
    // virtHp = 0; realHp = 0
    Hp = Hp + 40;
    // virtHp = 0; realHp = 40
    if (Hp > HpLim) goto c2ET; else goto c2ES;
    // If not enough space, GC
    HpAlloc = 40;
    R2 = _s28e::P64;
    R1 = withEmpty_riC_static_closure;
    call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
    // Write a bunch of data to the heap
    // First closure
    // virtHp = 0; realHp = 40
    I64[Hp - 32] = sat_s28f_info;
    _c2EG::P64 = Hp - 32;
    // virtHp = 8; realHp = 40
    I64[Hp - 16] = :_con_info;
    P64[Hp - 8] = _c2EG::P64;
    P64[Hp] = _s28e::P64;
    _c2EO::P64 = Hp - 14;
    // virtHp = 32; realHp = 40
    R1 = _c2EO::P64;
    // No stack updates this time
    // Jump to the continuation
    call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

(Actually, internally the offsets are recorded as words, so, this being 64-bit code, divide everything by eight. BTW, virtHp + 8 == realHp, and that's where the off-by-one comes from.) The math is a little fiddly, but getHpRelOffset will calculate the offsets for you; you just have to make sure the virtual offset is right!

OK, but we still haven’t figured out how we get this magic number 40 from in the first place! The key is to look at the code generator responsible for doing the heap check, heapCheck, which is wraps the call to code, which is actually responsible for the code generation:

heapCheck :: Bool -> Bool -> CmmAGraph -> FCode a -> FCode a
heapCheck checkStack checkYield do_gc code
  = getHeapUsage $ \ hpHw ->

Hey, what's that magic getHeapUsage function?

-- 'getHeapUsage' applies a function to the amount of heap that it uses.
-- It initialises the heap usage to zeros, and passes on an unchanged
-- heap usage.
-- It is usually a prelude to performing a GC check, so everything must
-- be in a tidy and consistent state.
-- Note the slightly subtle fixed point behaviour needed here

getHeapUsage :: (VirtualHpOffset -> FCode a) -> FCode a
getHeapUsage fcode
  = do  { info_down <- getInfoDown
        ; state <- getState
        ; let   fstate_in = state { cgs_hp_usg  = initHpUsage }
                (r, fstate_out) = doFCode (fcode hp_hw) info_down fstate_in
                hp_hw = heapHWM (cgs_hp_usg fstate_out)        -- Loop here!

        ; setState $ fstate_out { cgs_hp_usg = cgs_hp_usg state }
        ; return r }

And here, we see the monadic fixpoint. In order to provide the heap usage to fcode, GHC writes itself a check: hp_hw. The check is borrowed from the result of generating fcode, and the string attached is this: “As long as you don’t cash this check before you finish generating the code, everything will be OK!” (It’s a bit like a big bank in that respect.) Cute—and we only need to do the code generation once!

This technique is not without its dark side. hp_hw is dangerous; if you force it in the wrong place, you will chunder into an infinite loop. There are two uses of this variable, both in compiler/codeGen/StgCmmLayout.hs, which are careful not to force it. What would be nice is if one could explicitly mark hp_hw as blackholed, and attach a custom error message, to be emitted in the event of an infinite loop. How this might be accomplished is left as an exercise for the reader.

BTW, in case you aren't aware, I've been live-tumblr'ing coverage of ICFP at — the coverage is not 100%, and the editing is rough, but check it out!

Induction and logical relations

Logical relations are a proof technique which allow you to prove things such as normalization (all programs terminate) and program equivalence (these two programs are observationally equivalent under all program contexts). If you haven't ever encountered these before, I highly recommend Amal Ahmed's OPLSS lectures on the subject; you can find videos and notes from yours truly. (You should also be able to access her lectures from previous years.) This post is an excuse to talk about a formalization of two logical relations proofs in Agda I worked on during OPLSS and the weeks afterwards. I'm not going to walk through the code, but I do want expand on two points about logical relations:

  1. They work when simple induction would not, and
  2. The logical relation is not an inductive definition.

The full development is in the lr-agda repository on GitHub. Many thanks to Dan Licata for providing the initial development as a homework assignment for his OPLSS Agda course and for bushwhacking the substitution lemmas which are often the greatest impediment to doing proofs about the lambda calculus.

If you didn't know any better, you might try to prove normalization inductively, as follows:

To show that all programs normalize to a value, let us proceed inductively on the typing derivations. For example, in the application case, we need to show e1 e2 normalizes to some value v, given that e1 normalizes to v1 and e2 normalizes to v2. Well, the type of v1 is t1 -> t2, which means v1 = λx. e'. Uh oh: this should step to e'[v2/x], but I don’t know anything about this expression (e' could be anything). Stuck!

What is the extra oomph that logical relations gives you, that allows you to prove what was previously unprovable by usual induction? Let's think about our second proof sketch: the problem was that we didn't know anything e'. If we knew something extra about it, say, "Well, for some appropriate v, e'[v/x] will normalize," then we'd be able to make the proof go through. So if this definition of WN was our old proof goal:

WN : (τ : Tp) → [] ⊢ τ → Set
WN τ e = e ⇓

then what we'd like to do is extend this definition to include that "extra stuff":

WN : (τ : Tp) → [] ⊢ τ → Set
WN τ e = e ⇓ × WN' τ e
At this point, it would be good to make some remarks about how to read the Agda code here. WN is a type family, that is, WN τ e is the unary logical relation of type τ for expression e. The type of a type is Tp, which is a simple inductive definition; the type of a term is a more complicated [] ⊢ τ (utilizing Agda's mixfix operators; without them, you might write it Expr [] τ), which not only tells us that e is an expression, but well-typed, having the type τ under the empty context []. (This is an instance of the general pattern, where an inductive definition coincides with a well-formedness derivation, in this case the typing derivation.) e ⇓ is another mixfix operator, which is defined to be a traditional normalization (there exists some value v such that e reduces to v, e.g. Σ (λ v → value v × e ↦* v)).

But what is this extra stuff? In the case of simple types, e.g. booleans, we don't actually need anything extra, since we're never going to try to apply it like a function:

WN' : (τ : Tp) → [] ⊢ τ → Set
WN' bool e = Unit

For a function type, let us say that a function is WN (i.e. in the logical relation) if, when given a WN argument, it produces a WN result. (Because we refer to WN, this is in fact a mutually recursive definition.) This statement is in fact the key proof idea!

WN' (τ1 ⇒ τ2) e = (e1 : [] ⊢ τ1) → WN τ1 e1 → WN τ2 (app e e1)

There are a number of further details, but essentially, when you redo the proof, proving WN instead of plain old normalization, you no longer get stuck on the application case. Great! The flip-side, however, is that the proof in the lambda case is no longer trivial; you have to do a bit of work to show that the extra stuff (WN') holds. This was described to me as the "balloon" principle.

The two sides of the balloon are the "use of the inductive hypothesis" and the "proof obligation". When you have a weak inductive hypothesis, it doesn't give you very much, but you don't have to work as hard to prove it either. When you strengthen the inductive hypothesis, you can prove more things with it; however, your proof obligation is correspondingly increased. In the context of a normalization proof, the "use of the inductive hypothesis" shows up in the application case, and the "proof obligation" shows up in the lambda case. When you attempt the straightforward inductive proof, the lambda case is trivial, but the inductive hypothesis is so weak that the application case is impossible. In the logical relations proof, the application case falls out easily from the induction hypothesis, but you have to do more work in the lambda case.

We now briefly take a step back, to remark about the way we have defined the WN' type family, on our way to the discussion of why WN' is not an inductive definition. In Agda, there are often two ways of defining a type family: it can be done as a recursive function, or it can be done as an inductive definitions. A simple example of this is in the definition of a length indexed list. The standard inductive definition goes like this:

data Vec : Set → Nat → Set where
  vnil : {A : Set} → Vec A 0
  vcons : {A : Set} → A → Vec A n → Vec A (S n)

But I could also build the list out of regular old products, using a recursive function on the index:

Vec : Set → Nat → Set
Vec A 0 = Unit
Vec A (S n) = A × Vec A n

The two different encodings have their own strengths and weaknesses: using a recursive function often means that certain equalities are definitional (whereas you'd have to prove a lemma with the inductive definition), but an inductive definition lets you do case-analysis on the different possibilities.

Sometimes, it is simply not possible to do the inductive definition, and this is the case for a logical relation. This strengthening of the inductive hypothesis is to blame:

data WN' : τ → [] ⊢ τ → Set where
  WN/bool : {e : [] ⊢ bool} → WN' bool e
  WN/⇒ : {e1 : [] ⊢ τ1} → (WN τ1 e1 → WN τ2 (app e e1)) → WN' (τ1 ⇒ τ2) e

Agda doesn't complain about inductive-recursive definitions (though one should beware: they are not too well metatheoretically grounded at the moment), but it will complain about this definition. The problem is a familiar one: WN does not occur in strictly positive positions; in particular, it shows up as an argument to a function contained by the WN/⇒ constructor. So we can't use this!

As it turns out, the inability to define the logical relation inductively is not a big deal for normalization. However, it causes a bit more headaches for more complicated logical relations proofs, e.g. for equivalence of programs. When considering program equivalence, you need a binary relation relating values to values, saying when two values are equal. This can be stated very naturally inductively:

data V'⟦_⟧ : (τ : Tp) → [] ⊢ τ → [] ⊢ τ → Set where
   V/bool-#t : V'⟦ bool ⟧ #t #t
   V/bool-#f : V'⟦ bool ⟧ #f #f
   V/⇒ : {τ₁ τ₂ : Tp} {e₁ e₂ : [] ,, τ₁ ⊢ τ₂}
     → ((v₁ v₂ : [] ⊢ τ₁) → V'⟦ τ₁ ⟧ v₁ v₂ → E⟦ τ₂ ⟧ (subst1 v₁ e₁) (subst1 v₂ e₂))
     → V'⟦ τ₁ ⇒ τ₂ ⟧ (lam e₁) (lam e₂)

We define the relation by type. If a value is a boolean, then we say that #t (true) is related to itself, and #f is related to itself. If the value is a function, then we say that a lambda term is related to another lambda term if, when applied to two related values, the results are also related. This "function" is directly analogous to the extra stuff we added for the normalization proof. (If you like, you can mentally replace "related" with "equal", but this is misleading since it doesn't capture what is going on in the function case). But this fails the strict positivity check, so we have to define it recursively instead:

V⟦_⟧ : (τ : Tp) → [] ⊢ τ → [] ⊢ τ → Set
V⟦ bool ⟧    #t #t = Unit
V⟦ bool ⟧    #f #f = Unit
V⟦ bool ⟧ _ _  = Void
V⟦ τ₁ ⇒ τ₂ ⟧ (lam e) (lam e') = (v v' : [] ⊢ τ₁) → V⟦ τ₁ ⟧ v v' → E⟦ τ₂ ⟧ (subst1 v e) (subst1 v' e')
V⟦ τ₁ ⇒ τ₂ ⟧ _ _ = Void

Notice that the definition here is much less nice than the inductive definition: we need two fall through cases which assert a contradiction when two things could not possibly be equal, e.g. #t could not possibly be equal to #f. Furthermore, supposing that we are given V as a hypothesis while performing a proof, we can no longer just case-split on it to find out what kind of information we have; we have to laboriously first case split over the type and expression, at which point the function reduces. To give you a sense of how horrible this is, consider this function which converts from a inductive definition to the recursive definitions:

pV : {τ : Tp} → {e e' : [] ⊢ τ} → V⟦ τ ⟧ e e' → V'⟦ τ ⟧ e e'
pV {bool} {#t} {#t} V = V/bool-#t
pV {bool} {#t} {#f} ()
pV {bool} {#t} {if _ then _ else _} ()
pV {bool} {#t} {var _} ()
pV {bool} {#t} {app _ _} ()
pV {bool} {#f} {#t} ()
pV {bool} {#f} {#f} V = V/bool-#f
pV {bool} {#f} {if _ then _ else _} ()
pV {bool} {#f} {var _} ()
pV {bool} {#f} {app _ _} ()
pV {bool} {if _ then _ else _} ()
pV {bool} {var _} ()
pV {bool} {app _ _} ()
pV {_ ⇒ _} {if _ then _ else _} ()
pV {_ ⇒ _} {var _} ()
pV {_ ⇒ _} {lam _} {if _ then _ else _} ()
pV {_ ⇒ _} {lam _} {var _} ()
pV {_ ⇒ _} {lam _} {lam _} f = V/⇒ (\ v v' V → pE (f v v' V))
pV {_ ⇒ _} {lam _} {app _ _} ()
pV {_ ⇒ _} {app _ _} ()

Good grief! Perhaps the situation could be improved by improving how Agda handles wildcards in pattern matching, but at the moment, all of this is necessary.

"But wait, Edward!" you might say, "Didn't you just say you couldn't define it inductively?" Indeed, this function does not operate on the inductive definition I presented previously, but a slightly modified one, which banishes the non-strictly-positive occurrence by replacing it with V, the recursive definition:

V/⇒ : {τ₁ τ₂ : Tp} {e e' : [] ,, τ₁ ⊢ τ₂}
  → (V : (v v' : [] ⊢ τ₁) → V⟦ τ₁ ⟧ v v' {- the critical position! -} → E'⟦ τ₂ ⟧ (subst1 v e) (subst1 v' e'))
  → V'⟦ τ₁ ⇒ τ₂ ⟧ (lam e) (lam e')

This conversion function helps a lot, because agda-mode interacts a lot more nicely with inductive definitions (C-c C-c works!) than with recursive definitions in cases like this.

Why do logical relations in Agda? (or any proof assistant, for that matter?) Proofs using logical relations often follow the pattern of defining an appropriate logical relation for your problem, and then a lot of book-keeping to actually push the relation through a proof. Computers are great for doing book-keeping, and I think it is hugely informative to work through a logical relation proof in a proof assistant. An interesting challenge would be to extend this framework to a non-terminating language (adding step-indexes to the relation: the very pinnacle of book-keeping) or extending the lambda calculus with polymorphism (which requires some other interesting techniques for logical relations).

Cost semantics for STG in modern GHC

One of the problems with academic publishing is that it’s hard to keep old papers up-to-date. This is the certainly case for this 1995 Sansom paper on profiling non-strict, higher-order functional languages. While the basic ideas of the paper still hold, the actual implementation of cost centers in GHC has changed quite a bit, perhaps the most dramatic change being the introduction of cost center stacks. So while the old paper is good for giving you the basic idea of how profiling in GHC works, if you really want to know the details, the paper offers little guidance.

So what do you do when your cost semantics are out-of-date? Why, update them of course! I present an updated cost-semantics for STG in modern GHC (PDF) (GitHub). Eventually these will go in the GHC repository proper, alongside core-spec which is a similar document for Core. However, I haven't done any proofs with these semantics yet, so they are probably a little buggy.

Despite the lack of proofs, the formalization has been helpful already: I’ve already spotted one bug in the current implementation (remarked upon in the document). I’ve also identified a potential refactoring based on the way the rules are currently setup. Please let me know about any other bugs you find!