Existential Pontification and Generalized Abstract Digressions

## Elimination with a Motive (in Coq)

Elimination rules play an important role in computations over datatypes in proof assistants like Coq. In his paper "Elimination with a Motive", Conor McBride argued that "we should exploit a hypothesis not in terms of its immediate consequences, but in terms of the leverage it exerts on an arbitrary goal: we should give elimination a motive." In other words, proofs in a refinement setting (backwards reasoning) should use their goals to guide elimination.

I recently had the opportunity to reread this historical paper, and in the process, I thought it would be nice to port the examples to Coq. Here is the result:

http://web.mit.edu/~ezyang/Public/motive/motive.html

It's basically a short tutorial motivating John Major equality (also known as heterogenous equality.) The linked text is essentially an annotated version of the first part of the paper—I reused most of the text, adding comments here and there as necessary. The source is also available at:

http://web.mit.edu/~ezyang/Public/motive/motive.v

• May 7, 2014

## The cost of weak pointers and finalizers in GHC

Weak pointers and finalizers are a very convenient feature for many types of programs. Weak pointers are useful for implementing memotables and solving certain classes of memory leaks, while finalizers are useful for fitting "allocate/deallocate" memory models into a garbage-collected language. Of course, these features don’t come for free, and so one might wonder what the cost of utilizing these two (closely related) features are in GHC. In this blog post, I want to explain how weak pointers and finalizers are implemented in the GHC runtime system and characterize what extra overheads you incur by using them. These post assumes some basic knowledge about how the runtime system and copying garbage collection work.

### The userland API

The API for weak pointers is in System.Mem.Weak; in its full generality, a weak pointer consists of a key and a value, with the property that if the key is alive, then the value is considered alive. (A "simple" weak reference is simply one where the key and value are the same.) A weak pointer can also optionally be associated with a finalizer, which is run when the object is garbage collected. Haskell finalizers are not guaranteed to run.

Foreign pointers in Foreign.ForeignPtr also have a the capability to attach a C finalizer; i.e. a function pointer that might get run during garbage collection. As it turns out, these finalizers are also implemented using weak pointers, but C finalizers are treated differently from Haskell finalizers.

### Representation of weak pointers

A weak pointer is a special type of object with the following layout:

typedef struct _StgWeak {   /* Weak v */
StgClosure *cfinalizers;
StgClosure *key;
StgClosure *value;                /* v */
StgClosure *finalizer;
} StgWeak;


As we can see, we have pointers to the key and value, as well as separate pointers for a single Haskell finalizer (just a normal closure) and C finalizers (which have the type StgCFinalizerList). There is also a link field for linking weak pointers together. In fact, when the weak pointer is created, it is added to the nursery's list of weak pointers (aptly named weak_ptr_list). As of GHC 7.8, this list is global, so we do have to take out a global lock when a new weak pointer is allocated; however, the lock has been removed in HEAD.

### Garbage collecting weak pointers

Pop quiz! When we do a (minor) garbage collection on weak pointers, which of the fields in StgWeak are considered pointers, and which fields are considered non-pointers? The correct answer is: only the first field is considered a “pointer”; the rest are treated as non-pointers by normal GC. This is actually what you would expect: if we handled the key and value fields as normal pointer fields during GC, then they wouldn’t be weak at all.

Once garbage collection has been completed (modulo all of the weak references), we then go through the weak pointer list and check if the keys are alive. If they are, then the values and finalizers should be considered alive, so we mark them as live, and head back and do more garbage collection. This process will continue as long as we keep discovering new weak pointers to process; however, this will only occur when the key and the value are different (if they are the same, then the key must have already been processed by the GC). Live weak pointers are removed from the "old" list and placed into the new list of live weak pointers, for the next time.

Once there are no more newly discovered live pointers, the list of dead pointers is collected together, and the finalizers are scheduled (scheduleFinalizers). C finalizers are run on the spot during GC, while Haskell finalizers are batched together into a list and then shunted off to a freshly created thread to be run.

That's it! There are some details for how to handle liveness of finalizers (which are heap objects too, so even if an object is dead we have to keep the finalizer alive for one more GC) and threads (a finalizer for a weak pointer can keep a thread alive).

### Tallying up the costs

To summarize, here are the extra costs of a weak pointer:

1. Allocating a weak pointer requires taking a global lock (will be fixed in GHC 7.10) and costs six words (fairly hefty as far as Haskell heap objects tend to go.)
2. During each minor GC, processing weak pointers takes time linear to the size of the weak pointer lists for all of the generations being collected. Furthermore, this process involves traversing a linked list, so data locality will not be very good. This process may happen more than once, although once it is determined that a weak pointer is live, it is not processed again. The cost of redoing GC when a weak pointer is found to be live is simply the cost of synchronizing all parallel GC threads together.
3. The number of times you have to switch between GC'ing and processing weak pointers depends on the structure of the heap. Take a heap and add a special "weak link" from a key to its dependent weak value. Then we can classify objects by the minimum number of weak links we must traverse from a root to reach the object: call this the "weak distance". Supposing that a given weak pointer's weak distance is n, then we spend O(n) time processing that weak pointer during minor GC. The maximum weak distance constitutes how many times we need to redo the GC.

In short, weak pointers are reasonably cheap when they are not deeply nested: you only pay the cost of traversing a linked list of all of the pointers you have allocated once per garbage collection. In the pessimal case (a chain of weak links, where the value of each weak pointer was not considered reachable until we discovered its key is live in the previous iteration), we can spend quadratic time processing weak pointers.

• May 4, 2014

## Calculating Shanten in Mahjong

Move aside, poker! While the probabilities of various poker hands are well understood and tabulated, the Chinese game of chance Mahjong [1] enjoys a far more intricate structure of expected values and probabilities. [2] This is largely due in part to the much larger variety of tiles available (136 tiles, as opposed to the standard playing card deck size of 52), as well as the turn-by-turn game play, which means there is quite a lot of strategy involved with what is ostensibly a game of chance. In fact, the subject is so intricate, I’ve decided to write my PhD thesis on it. This blog post is a condensed version of one chapter of my thesis, considering the calculation of shanten, which we will define below. I’ll be using Japanese terms, since my favorite variant of mahjong is Riichi Mahjong; you can consult the Wikipedia article on the subject if you need to translate.

### Calculating Shanten

The basic gameplay of Mahjong involves drawing a tile into a hand of thirteen tiles, and then discarding another tile. The goal is to form a hand of fourteen tiles (that is, after drawing, but before discarding a tile) which is a winning configuration. There are a number of different winning configurations, but most winning configurations share a similar pattern: the fourteen tiles must be grouped into four triples and a single pair. Triples are either three of the same tile, or three tiles in a sequence (there are three “suits” which can be used to form sequences); the pair is two of the same tiles. Here is an example:

Represented numerically, this hand consists of the triples and pairs 123 55 234 789 456.

One interesting quantity that is useful to calculate given a mahjong hand is the shanten number, that is, the number of tiles away from winning you are. This can be used to give you the most crude heuristic of how to play: discard tiles that get you closer to tenpai. The most widely known shanten calculator is this one on Tenhou’s website [3]; unfortunately, the source code for this calculator is not available. There is another StackOverflow question on the subject, but the “best” answer offers only a heuristic approach with no proof of correctness! Can we do better?

Naïvely, the shanten number is a breadth first search on the permutations of a hand. When a winning hand is found, the algorithm terminates and indicates the depth the search had gotten to. Such an algorithm is obviously correct; unfortunately, with 136 tiles, one would have to traverse $((136-13)\times 14)^n$ hands (choices of new tiles times choices of discard) while searching for a winning hand that is n-shanten away. If you are four tiles away, you will have to traverse over six trillion hands. We can reduce this number by avoiding redundant work if we memoize the shanten associated with hands: however, the total number of possible hands is roughly $136 \choose 13$, or 59 bits. Though we can fit (via a combinatorial number system) a hand into a 64-bit integer, the resulting table is still far too large to hope to fit in memory.

The trick is to observe that shanten calculation for each of the suits is symmetric; thus, we can dynamic program over a much smaller space of the tiles 1 through 9 for some generic suit, and then reuse these results when assembling the final calculation. $9 \times 4 \choose 13$ is still rather large, so we can take advantage of the fact that because there are four copies of each tile, an equivalent representation is a 9-vector of the numbers zero to four, with the constraint that the sum of these numbers is 13. Even without the constraint, the count $5^9$ is only two million, which is quite tractable. At a byte per entry, that’s 2MB of memory; less than your browser is using to view this webpage. (In fact, we want the constraint to actually be that the sum is less than or equal to 13, since not all hands are single-suited, so the number of tiles in a hand is less.

The breadth-first search for solving a single suit proceeds as follows:

1. Initialize a table A indexed by tile configuration (a 9-vector of 0..4).
2. Initialize a todo queue Q of tile configurations.
3. Initialize all winning configurations in table A with shanten zero (this can be done by enumeration), recording these configurations in Q.
4. While the todo queue Q is not empty, pop the front element, mark the shanten of all adjacent uninitialized nodes as one greater than that node, and push those nodes onto the todo queue.

With this information in hand, we can assemble the overall shanten of a hand. It suffices to try every distribution of triples and the pairs over the four types of tiles (also including null tiles), consulting the shanten of the requested shape, and return the minimum of all these configurations. There are $4 \times {4 + 4 - 1 \choose 4}$ (by stars and bars) combinations, for a total of 140 configurations. Computing the shanten of each configuration is a constant time operation into the lookup table generated by the per-suit calculation. A true shanten calculator must also accomodate the rare other hands which do not follow this configuration, but these winning configurations are usually highly constrained, and quite easily to (separately) compute the shanten of.

With a shanten calculator, there are a number of other quantities which can be calculated. Uke-ire refers to the number of possible draws which can reduce the shanten of your hand: one strives for high uke-ire because it means that probability that you will draw a tile which moves your hand closer to winning. Given a hand, it's very easy to calculate its uke-ire: just look at all adjacent hands and count the number of hands which have lower shanten.

### Further extensions

Suppose that you are trying to design an AI which can play Mahjong. Would the above shanten calculator provide a good evaluation metric for your hand? Not really: it has a major drawback, in that it does not consider the fact that some tiles are simply unavailable (they were discarded). For example, if all four “nine stick” tiles are visible on the table, then no hand configuration containing a nine stick is actually reachable. Adjusting for this situation is actually quite difficult, for two reasons: first, we can no longer precompute a shanten table, since we need to adjust at runtime what the reachability metric is; second, the various suits are no longer symmetric, so we have to do three times as much work. (We can avoid an exponential blowup, however, since there is no inter-suit interaction.)

Another downside of the shanten and uke-ire metrics is that they are not direct measures of “tile efficiency”: that is, they do not directly dictate a strategy for discards which minimizes the expected time before you get a winning hand. Consider, for example, a situation where you have the tiles 233, and only need to make another triple in order to win. You have two possible discards: you can discard a 2 or a 3. In both cases, your shanten is zero, but discarding a 2, you can only win by drawing a 3, whereas discarding a 3, you can win by drawing a 1 or a 4. Maximizing efficiency requires considering the lifetime ure-kire of your hands.

Even then, perfect tile efficiency is not enough to see victory: every winning hand is associated with a point-score, and so in many cases it may make sense to go for a lower-probability hand that has higher expected value. Our decomposition method completely falls apart here, as while the space of winning configurations can be partitioned, scoring has nonlocal effects, so the entire hand has to be considered as a whole. In such cases, one might try for a Monte Carlo approach, since the probability space is too difficult to directly characterize. However, in the Japanese Mahjong scoring system, there is yet another difficulty with this approach: the scoring system is exponential. Thus, we are in a situation where the majority of samples will be low scoring, but an exponentially few number of samples have exponential payoff. In such cases, it’s difficult to say if random sampling will actually give a good result, since it is likely to miscalculate the payoff, unless exponentially many samples are taken. (On the other hand, because these hands are so rare, an AI might do considerably well simply ignoring them.)

To summarize, Mahjong is a fascinating game, whose large state space makes it difficult to accurately characterize the probabilities involved. In my thesis, I attempt to tackle some of these questions; please check it out if you are interested in more.

[1] No, I am not talking about the travesty that is mahjong solitaire.

[2] To be clear, I am not saying that poker strategy is simple—betting strategy is probably one of the most interesting parts of the game—I am simply saying that the basic game is rather simple, from a probability perspective.

[3] Tenhou is a popular Japanese online mahjong client. The input format for the Tenhou calculator is 123m123p123s123z, where numbers before m indicate man tiles, p pin tiles, s sou tiles, and z honors (in order, they are: east, south, west, north, white, green, red). Each entry indicates which tile you can discard to move closer to tenpai; the next list is of ure-kire (and the number of tiles which move the hand further).

• April 1, 2014

So you may have heard about this popular new programming language called Haskell. What's Haskell? Haskell is a non-dependently typed programming language, sporting general recursion, type inference and built-in side-effects. It is true that dependent types are considered an essential component of modern, expressive type systems. However, giving up dependence can result in certain benefits for other aspects of software engineering, and in this article, we'd like to talk about the omissions that Haskell makes to support these changes.

### Syntax

There are a number of syntactic differences between Coq and Haskell, which we will point out as we proceed in this article. To start with, we note that in Coq, typing is denoted using a single colon (false : Bool); in Haskell, a double colon is used (False :: Bool). Additionally, Haskell has a syntactic restriction, where constructors must be capitalized, while variables must be lower-case.

Similar to my OCaml for Haskellers post, code snippets will have the form:

(* Coq *)

{- Haskell -}


### Universes/kinding

A universe is a type whose elements are types. They were originally introduced to constructive type theory by Per Martin-Löf. Coq sports an infinite hierarchy of universes (e.g. Type (* 0 *) : Type (* 1 *), Type (* 1 *) : Type (* 2 *), and so forth).

Given this, it is tempting to draw an analogy between universes and Haskell’s kind of types * (pronounced “star”), which classifies types in the same way Type (* 0 *) classifies primitive types in Coq. Furthermore, the sort box classifies kinds (* : BOX, although this sort is strictly internal and cannot be written in the source language). However, the resemblance here is only superficial: it is misleading to think of Haskell as a language with only two universes. The differences can be summarized as follows:

1. In Coq, universes are used purely as a sizing mechanism, to prevent the creation of types which are too big. In Haskell, types and kinds do double duty to enforce the phase distinction: if a has kind *, then x :: a is guaranteed to be a runtime value; likewise, if k has sort box, then a :: k is guaranteed to be a compile-time value. This structuring is a common pattern in traditional programming languages, although knowledgeable folks like Conor McBride think that ultimately this is a design error, since one doesn’t really need a kinding system to have type erasure.
2. In Coq, universes are cumulative: a term which has type Type (* 0 *) also has type Type (* 1 *). In Haskell, there is no cumulativity between between types and kinds: if Nat is a type (i.e. has the type *), it is not automatically a kind. However, in some cases, partial cumulativity can be achieved using datatype promotion, which constructs a separate kind-level replica of a type, where the data constructors are now type-level constructors. Promotion is also capable of promoting type constructors to kind constructors.
3. In Coq, a common term language is used at all levels of universes. In Haskell, there are three distinct languages: a language for handling base terms (the runtime values), a language for handling type-level terms (e.g. types and type constructors) and a language for handling kind-level terms. In some cases this syntax is overloaded, but in later sections, we will often need to say how a construct is formulated separately at each level of the kinding system.

One further remark: Type in Coq is predicative; in Haskell, * is impredicative, following the tradition of System F and other languages in the lambda cube, where kinding systems of this style are easy to model.

### Function types

In Coq, given two types A and B, we can construct the type A -> B denoting functions from A to B (for A and B of any universe). Like Coq, functions with multiple arguments are natively supported using currying. Haskell supports function types for both types (Int -> Int) and kinds (* -> *, often called type constructors) and application by juxtaposition (e.g. f x). (Function types are subsumed by pi types, however, we defer this discussion for later.) However, Haskell has some restrictions on how one may construct functions, and utilizes different syntax when handling types and kinds:

For expressions (with type a -> b where a, b :: *), both direct definitions and lambdas are supported. A direct definition is written in an equational style:

Definition f x := x + x.

f x = x + x


while a lambda is represented using a backslash:

fun x => x + x

\x -> x + x


For type families (with type k1 -> k2 where k1 and k2 are kinds), the lambda syntax is not supported. In fact, no higher-order behavior is permitted at the type-level; while we can directly define appropriately kinded type functions, at the end of the day, these functions must be fully applied or they will be rejected by the type-checker. From an implementation perspective, the omission of type lambdas makes type inference and checking much easier.

1. Type synonyms:

Definition Endo A := A -> A.

type Endo a = a -> a


Type synonyms are judgmentally equal to their expansions. As mentioned in the introduction, they cannot be partially applied. They were originally intended as a limited syntactic mechanism for making type signatures more readable.

2. Closed type (synonym) families:

Inductive fcode :=
| intcode : fcode
| anycode : fcode.
Definition interp (c : fcode) : Type := match c with
| intcode -> bool
| anycode -> char
end.

type family F a where
F Int = Bool
F a   = Char


While closed type families look like the addition of typecase (and would violate parametricity in that case), this is not the case, as closed type families can only return types. In fact, closed type families correspond to a well-known design pattern in Coq, where one writes inductive data type representing codes of types, and then having an interpretation function which interprets the codes as actual types. As we have stated earlier, Haskell has no direct mechanism for defining functions on types, so this useful pattern had to be supported directly in the type families functionality. Once again, closed type families cannot be partially applied.

In fact, the closed type family functionality is a bit more expressive than an inductive code. In particular, closed type families support non-linear pattern matches (F a a = Int) and can sometimes reduce a term when no iota reductions are available, because some of the inputs are not known. The reason for this is because closed type families are “evaluated” using unification and constraint-solving, rather than ordinary term reduction as would be the case with codes in Coq. Indeed, nearly all of the “type level computation” one may perform in Haskell, is really just constraint solving. Closed type families are not available in a released version of GHC (yet), but there is a Haskell wiki page describing closed type families in more detail.

3. Open type (synonym) families:

(* Not directly supported in Coq *)

type family F a
type instance F Int = Char
type instance F Char = Int


Unlike closed type families, open type families operate under an open universe, and have no analogue in Coq. Open type families do not support nonlinear matching, and must completely unify to reduce. Additionally, there are number of restrictions on the left-hand side and right-hand side of such families in order maintain decidable type inference. The section of the GHC manual Type instance declarations expands on these limitations.

Both closed and type-level families can be used to implement computation at the type-level of data constructors which were lifted to the type-level via promotion. Unfortunately, any such algorithm must be implemented twice: once at the expression level, and once at the type level. Use of metaprogramming can alleviate some of the boilerplate necessary; see, for example, the singletons library.

### Dependent function types (Π-types)

A Π-type is a function type whose codomain type can vary depending on the element of the domain to which the function is applied. Haskell does not have Π-types in any meaningful sense. However, if you only want to use a Π-type solely for polymorphism, Haskell does have support. For polymorphism over types (e.g. with type forall a : k, a -> a, where k is a kind), Haskell has a twist:

Definition id : forall (A : Type), A -> A := fun A => fun x => x.

id :: a -> a
id = \x -> x


In particular, the standard notation in Haskell is to omit both the type-lambda (at the expression level) and the quantification (at the type level). The quantification at the type level can be recovered using the explicit universal quantification extension:

id :: forall a. a -> a


However, there is no way to directly explicitly state the type-lambda. When the quantification is not at the top-level, Haskell requires an explicit type signature with the quantification put in the right place. This requires the rank-2 (or rank-n, depending on the nesting) polymorphism extension:

Definition f : (forall A, A -> A) -> bool := fun g => g bool true.

f :: (forall a. a -> a) -> Bool
f g = g True


Polymorphism is also supported at the kind-level using the kind polymorphism extension. However, there is no explicit forall for kind variables; you must simply mention a kind variable in a kind signature.

Proper dependent types cannot be supported directly, but they can be simulated by first promoting data types from the expression level to the type-level. A runtime data-structure called a singleton is then used to refine the result of a runtime pattern-match into type information. This pattern of programming in Haskell is not standard, though there are recent academic papers describing how to employ it. One particularly good one is Hasochism: The Pleasure and Pain of Dependently Typed Haskell Program, by Sam Lindley and Conor McBride.

### Product types

Coq supports cartesian product over types, as well as a nullary product type called unit. Very similar constructs are also implemented in the Haskell standard library:

(true, false) : bool * bool
(True, False) :: (Bool, Bool)

tt : unit
() :: ()


Pairs can be destructed using pattern-matching:

match p with
| (x, y) => ...
end

case p of
(x, y) -> ...


Red-blooded type theorists may take issue with this identification: in particular, Haskell’s default pair type is what is considered a negative type, as it is lazy in its values. (See more on polarity.) As Coq’s pair is defined inductively, i.e. positively, a more accurate identification would be with a strict pair, defined as data SPair a b = SPair !a !b; i.e. upon construction, both arguments are evaluated. This distinction is difficult to see in Coq, since positive and negative pairs are logically equivalent, and Coq does not distinguish between them. (As a total language, it is indifferent to choice of evaluation strategy.) Furthermore, it's relatively common practice to extract pairs into their lazy variants when doing code extraction.

### Dependent pair types (Σ-types)

Dependent pair types are the generalization of product types to be dependent. As before, Σ-types cannot be directly expressed, except in the case where the first component is a type. In this case, there is an encoding trick utilizing data types which can be used to express so-called existential types:

Definition p := exist bool not : { A : Type & A -> bool }

data Ex = forall a. Ex (a -> Bool)
p = Ex not


As was the case with polymorphism, the type argument to the dependent pair is implicit. It can be specified explicitly by way of an appropriately placed type annotation.

### Recursion

In Coq, all recursive functions must have a structurally decreasing argument, in order to ensure that all functions terminate. In Haskell, this restriction is lifted for the expression level; as a result, expression level functions may not terminate. At the type-level, by default, Haskell enforces that type level computation is decidable. However, this restriction can be lifted using the UndecidableInstances flag. It is generally believed that undecidable instances cannot be used to cause a violation of type safety, as nonterminating instances would simply cause the compiler to loop infinitely, and due to the fact that in Haskell, types cannot (directly) cause a change in runtime behavior.

### Inductive types/Recursive types

In Coq, one has the capacity to define inductive data types. Haskell has a similar-looking mechanism for defining data types, but there are a number of important differences which lead many to avoid using the moniker inductive data types for Haskell data types (although it’s fairly common for Haskellers to use the term anyway.)

Basic types like boolean can be defined with ease in both languages (in all cases, we will use the GADT syntax for Haskell data-types, as it is closer in form to Coq’s syntax and strictly more powerful):

Inductive bool : Type :=
| true : bool
| false : bool.

data Bool :: * where
True :: Bool
False :: Bool


Both also support recursive occurrences of the type being defined:

Inductive nat : Type :=
| z : nat
| s : nat -> nat.

data Nat :: * where
Z :: Nat
S :: Nat -> Nat


One has to be careful though: our definition of Nat in Haskell admits one more term: infinity (an infinite chain of successors). This is similar to the situation with products, and stems from the fact that Haskell is lazy.

Haskell’s data types support parameters, but these parameters may only be types, and not values. (Though, recall that data types can be promoted to the type level). Thus, the standard type family of vectors may be defined, assuming an appropriate type-level nat (as usual, explicit forall has been omitted):

Inductive vec (A : Type) : nat -> Type :=
| vnil  : vec A 0
| vcons : forall n, A -> vec A n -> vec A (S n)

data Vec :: Nat -> * -> * where
VNil  :: Vec Z a
VCons :: a -> Vec n a -> Vec (S n) a


As type-level lambda is not supported but partial application of data types is (in contrast to type families), the order of arguments in the type must be chosen with care. (One could define a type-level flip, but they would not be able to partially apply it.)

Haskell data type definitions do not have the strict positivity requirement, since we are not requiring termination; thus, peculiar data types that would not be allowed in Coq can be written:

data Free f a where
Free :: f (Free f a) -> Free f a
Pure :: a -> Free f a

data Mu f where
Roll :: f (Mu f) -> Mu f


### Inference

Coq has support for requesting that a term be inferred by the unification engine, either by placing an underscore in a context or by designating an argument as implicit (how one might implement in Coq the omission of type arguments of polymorphic functions as seen in Haskell). Generally, one cannot expect all inference problems in a dependently typed language to be solvable, and the inner-workings of Coq’s unification engines (plural!) are considered a black art (no worry, as the trusted kernel will verify that the inferred arguments are well-typed).

Haskell as specified in Haskell'98 enjoys principal types and full type inference under Hindley-Milner. However, to recover many of the advanced features enjoyed by Coq, Haskell has added numerous extensions which cannot be easily accomodated by Hindley-Milner, including type-class constraints, multiparameter type classes, GADTs and type families. The current state-of-the-art is an algorithm called OutsideIn(X). With these features, there are no completeness guarantee. However, if the inference algorithm accepts a definition, then that definition has a principal type and that type is the type the algorithm found.

### Conclusion

This article started as a joke over in OPLSS'13, where I found myself explaining some of the hairier aspects of Haskell’s type system to Jason Gross, who had internalized Coq before he had learned much Haskell. Its construction was iced for a while, but later I realized that I could pattern the post off of the first chapter of the homotopy type theory book. While I am not sure how useful this document will be for learning Haskell, I think it suggests a very interesting way of mentally organizing many of Haskell’s more intricate type-system features. Are proper dependent types simpler? Hell yes. But it’s also worth thinking about where Haskell goes further than most existing dependently typed languages...

### Postscript

Bob Harper complained over Twitter that this post suggested misleading analogies in some situations. I've tried to correct some of his comments, but in some cases I wasn't able to divine the full content of his comments. I invite readers to see if they can answer these questions:

1. Because of the phase distinction, Haskell’s type families are not actually type families, in the style of Coq, Nuprl or Agda. Why?
2. This post is confused about the distinction between elaboration (type inference) and semantics (type structure). Where is this confusion?
3. Quantification over kinds is not the same as quantification over types. Why?
• March 17, 2014

## Equality, roughly speaking

In Software Foundations, equality is defined in this way:

Even Coq's equality relation is not built in. It has (roughly) the following inductive definition.

Inductive eq0 {X:Type} : X -> X -> Prop :=
refl_equal0 : forall x, eq0 x x.


Why the roughly? Well, as it turns out, Coq defines equality a little differently (reformatted to match the Software Foundations presentation):

Inductive eq1 {X:Type} (x:X) : X -> Prop :=
refl_equal1 : eq1 x x.


What’s the difference? The trick is to look at the induction principles that Coq generates for each of these:

eq0_ind
: forall (X : Type) (P : X -> X -> Prop),
(forall x : X, P x x) -> forall y y0 : X, eq0 y y0 -> P y y0

eq1_ind
: forall (X : Type) (x : X) (P : X -> Prop),
P x -> forall y : X, eq1 x y -> P y


During our Homotopy Type Theory reading group, Jeremy pointed out that the difference between these two principles is exactly the difference between path induction (eq0) and based path induction (eq1). (This is covered in the Homotopy Type Theory book in section 1.12) So, Coq uses the slightly weirder definition because it happens to be a bit more convenient. (I’m sure this is folklore, but I sure didn’t notice this until now! For more reading, check out this excellent blog post by Dan Licata.)

• January 30, 2014

## How to maintain a pristine copy of your configuration files

etckeeper is a pretty good tool for keeping your /etc under version control, but one thing that it won’t tell you is what the diff between your configuration and a pristine version of your configuration (if you installed the same packages on the system, but didn’t change any configuration). People have wanted this, but I couldn’t find anything that actually did this. A month ago, I figured out a nice, easy way to achieve this under etckeeper with a Git repository. The idea is to maintain a pristine branch, and when an upgrade occurs, automatically apply the patch (automatically generated) to a pristine branch. This procedure works best on a fresh install, since I don’t have a good way of reconstructing history if you haven’t been tracking the pristine from the start.

Here’s how it goes:

1. Install etckeeper. It is best if you are using etckeeper 1.10 or later, but if not, you should replace 30store-metadata with a copy from the latest version. This is important, because pre-1.10, the metadata store included files that were ignored, which means you’ll get lots of spurious conflicts.

2. Initialize the Git repository using etckeeper init and make an initial commit git commit.

3. Create a pristine branch: git branch pristine (but stay on the master branch)

4. Modify the etckeeper configuration so that VCS="git", AVOID_DAILY_AUTOCOMMITS=1 and AVOID_COMMIT_BEFORE_INSTALL=1:

diff --git a/etckeeper/etckeeper.conf b/etckeeper/etckeeper.conf
index aedf20b..99b4e43 100644
--- a/etckeeper/etckeeper.conf
+++ b/etckeeper/etckeeper.conf
@@ -1,7 +1,7 @@
# The VCS to use.
#VCS="hg"
-#VCS="git"
-VCS="bzr"
+VCS="git"
+#VCS="bzr"
#VCS="darcs"

# Options passed to git commit when run by etckeeper.
@@ -18,7 +18,7 @@ DARCS_COMMIT_OPTIONS="-a"

# Uncomment to avoid etckeeper committing existing changes
# to /etc automatically once per day.
-#AVOID_DAILY_AUTOCOMMITS=1
+AVOID_DAILY_AUTOCOMMITS=1

# Uncomment the following to avoid special file warning
# (the option is enabled automatically by cronjob regardless).
@@ -27,7 +27,7 @@ DARCS_COMMIT_OPTIONS="-a"
# Uncomment to avoid etckeeper committing existing changes to
# /etc before installation. It will cancel the installation,
# so you can commit the changes by hand.
-#AVOID_COMMIT_BEFORE_INSTALL=1
+AVOID_COMMIT_BEFORE_INSTALL=1

# The high-level package manager that's being used.
# (apt, pacman-g2, yum, zypper etc)

5. Apply this patch to etckeeper/commit.d/50vcs-commit. This patch is responsible for keeping the pristine branch up-to-date (more explanation below).

6. Create a .gitattributes file with contents .etckeeper merge=union. This makes merges on the metadata file use the union strategy, which reduces spurious conflicts dramatically:

diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000..b7a1f4d
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+.etckeeper merge=union

7. Commit these changes.

8. Permit pushes to the checked out /etc by running git config receive.denyCurrentBranch warn

9. All done! Try installing a package that has some configuration and then running sudo gitk in /etc to view the results. You can run a diff by running sudo git diff pristine master.

So, what’s going on under the hood? The big problem that blocked me from a setup like this in the past is that you would like the package manager to apply its changes into the pristine etc, so that you can merge in the changes yourself on the production version, but it’s not obvious how to convince dpkg that /etc lives somewhere else. Nor do you want to revert your system configuration to pristine version, apply the update, and then revert back: this is just asking for trouble. So the idea is to apply the (generated) patch as normal, but then reapply the patch (using a cherry-pick) to the pristine branch, and then rewrite history so the parent pointers are correct. All of this happens outside of /etc, so the production copy of the configuration files never gets touched.

Of course, sometimes the cherry-pick might fail. In that case, you’ll get an error like this:

Branch pristine set up to track remote branch pristine from origin.
Switched to a new branch 'pristine'
error: could not apply 4fed9ce... committing changes in /etc after apt run
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'
Failed to import changes to pristine
TMPREPO = /tmp/etckeeper-gitrepo.CUCpBEuVXg
TREEID = 8c2fbef8a8f3a4bcc4d66d996c5362c7ba8b17df


Do not fret: all that has happened is that the pristine branch is not up-to-date. You can resolve this problem by looking at \$TMPREPO/etc, where you will see some sort of merge conflict. Resolve the conflict and commit. Now you will need to manually complete the rest of the script, this can be done with:

git checkout master
git merge -s ours pristine
git push -f origin master
git push origin pristine


To make sure you did it right, go back to /etc and run git status: it should report the working directory as clean. Otherwise, there are discrepancies and you may not have done the merges correctly.

I’ve been testing this setup for a month now, and it has proceeded very smoothly (though I’ve never attempted to do a full release upgrade with this setup). Unfortunately, as I’ve said previously, I don’t have a method for constructing a pristine branch from scratch, if you have an existing system you’d like to apply this trick to. There’s nothing stopping you, though: you can always decide to start, in which case you will record just the diffs from the time you started recording pristine. Give it a spin!

• January 20, 2014

## PEPM’14: The HERMIT in the Stream

POPL is almost upon us! I’ll be live-Tumblr-ing it when the conference comes upon us proper, but in the meantime, I thought I’d write a little bit about one paper in the colocated PEPM'14 program: The HERMIT in the Stream, by Andrew Farmer, Christian Höner zu Sierdissen and Andy Gill. This paper presents an implementation of an optimization scheme for fusing away use of the concatMap combinator in the stream fusion framework, which was developed using the HERMIT optimization framework. The HERMIT project has been chugging along for some time now, and a stream of papers of various applications of the framework have been trickling out (as anyone who was at the Haskell implementors workshop can attest.)

“But wait,” you may ask, “don’t we already have stream fusion?” You’d be right: but while stream fusion is available as a library, it has not replaced the default fusion system that ships with GHC: foldr/build fusion. What makes fusion scheme good? One important metric is the number of list combinators it supports. Stream fusion nearly dominates foldr/build fusion, except for the case of concatMap, a problem which has resisted resolution for seven years and has prevented GHC from switching to using stream fusion as its default.

As it turns out, we’ve known how to optimize concatMap for a long time; Duncan Coutts gave a basic outline in his thesis. The primary contribution of this paper was a prototype implementation of this optimization, including an elucidation of the important technical details (increasing the applicability of the original rule, necessary modifications to the simplifier, and rules for desugaring list comprehensions). The paper also offers some microbenchmarks and real world benchmarks arguing for the importance of optimizing concatMap.

I was glad to see this paper, since it is an important milestone on the way to replacing foldr/build fusion with stream fusion in the GHC standard libraries. It also seems the development of this optimization was greatly assisted by the use HERMIT, which seems like a good validation for HERMIT (though the paper does not go into very much detail about how HERMIT assisted in the process of developing this optimization).

There is something slightly unsatisfying with the optimization as stated in the paper, which can be best articulated by considering the paper from the perspective of a prospective implementor of stream fusion. She has two choices:

• She can try to use the HERMIT system directly. However, HERMIT induces a 5-20x compilation slowdown, which is quite discouraging for real use. This slowdown is probably not fundamental, and will be erased in due time, but that is certainly not the case today. The limited implementation of stream fusion in the prototype (they don’t implement all of the combinators, just enough so they could run their numbers) also recommends against direct use of the system.
• She can directly incorporate the rules as stated into a compiler. This would require special-case code to apply the non-semantics preserving simplifications only to streams, and essentially would require a reimplementation of the system, with the guidance offered by this paper. But this special-case code is of limited applicability beyond its utility for concatMap, which is a negative mark.

So, it seems, at least from the perspective of an average GHC user, we will have to wait a bit longer before stream fusion is in our hands. Still, I agree that the microbenchmarks and ADPFusion case study show the viability of the approach, and the general principle of the novel simplification rules seems reasonable, if a little ad hoc.

One note if you’re reading the nofib performance section: the experiment was done comparing their system to foldr/build, so the delta is mostly indicative of the benefit of stream fusion (in the text, they point out which benchmarks benefitted the most from concatMap fusion). Regardless, it’s a pretty cool paper: check it out!

• January 17, 2014

## Ott ⇔ PLT Redex

Ott and PLT Redex are a pair of complimentary tools for the working semanticist. Ott is a tool for writing definitions of programming languages in a nice ASCII notation, which then can be typeset in LaTeX or used to generate definitions for a theorem prover (e.g. Coq). PLT Redex is a tool for specifying and debugging operational semantics. Both tools are easy to install, which is a big plus. Since the tools are quite similar, I thought it might be interesting to do a comparison of how various common tasks are done in both languages. (Also, I think the Redex manual is pretty terrible.)

Variables. In Ott, variables are defined by way of metavariables (metavar x), which then serve as variable (by either using the metavariable alone, or suffixing it with a number, index variable or tick).

In Redex, there is no notion of a metavariable; a variable is just another production. There are a few different ways say that a production is a variable: the simplest method is to use variable-not-otherwise-mentioned, which automatically prevents keywords from acting as variables. There are also several other variable patterns variable, variable-except and variable-prefix, which afford more control over what symbols are considered variables. side-condition may also be useful if you have a function which classifies variables.

Grammar. Both Ott and Redex can identify ambiguous matches. Ott will error when it encounters an ambiguous parse. Redex, on the other hand, will produce all valid parses; while this is not so useful when parsing terms, it is quite useful when specifying non-deterministic operational semantics (although this can have bad performance implications). check-redundancy may be useful to identify ambiguous patterns.

Binders. In Ott, binders are explicitly declared in the grammar using bind x in t; there is also a binding language for collecting binders for pattern-matching. Ott can also generate substitution/free variable functions for the semantics. In Redex, binders are not stated in the grammar; instead, they are implemented solely in the reduction language, usually using substitution (Redex provides a workhorse substitution function for this purpose), and explicitly requiring a variable to be fresh. Redex does have a special-form in the metalanguage for doing let-binding (term-let), which substitutes immediately.

Lists. Ott supports two forms of lists: dot forms and list comprehensions. A dot form looks like x1 , .. , xn and requires an upper bound. A list comprehension looks like </ xi // i IN 1 .. n />; the bounds can be omitted. A current limitation of Ott is that it doesn’t understand how to deal with nested dot forms, this can be worked around by doing a comprension over a production, and then elsewhere stating the appropriate equalities the production satisfies.

Redex supports lists using ellipsis patterns, which looks like (e ...). There is no semantic content here: the ellipses simply matches zero or more copies of e, which can lead to nondeterministic matches when there are multiple ellipses. Nested ellipses are supported, and simply result in nested lists. Bounds can be specified using side-conditions; however, Redex supports a limited form of bounding using named ellipses (e.g. ..._1), where all ellipses with the same name must have the same length.

Semantics. Ott is agnostic to whatever semantics you want to define; arbitrary judgments can be specified. One can also define judgments as usual in Redex, but Redex provides special support for evaluation semantics, in which a semantics is given in terms of evaluation contexts, thus allowing you to avoid the use of structural rules. So a usual use-case is to define a normal expression language, extend the language to have evaluation contexts, and then define a reduction-relation using in-hole to do context decomposition. The limitation is that if you need to do anything fancy (e.g. multi-hole evaluation contexts), you will have to fall back to judgment forms.

Type-setting. Ott supports type-setting by translation into LaTeX. Productions can have custom LaTeX associated with them, which is used to generate their output. Redex has a pict library for directly typesetting into PDF or Postscript; it doesn’t seem like customized typesetting is an intended use-case for PLT Redex, though it can generate reasonable Lisp-like output.

Conclusion. If I had to say what the biggest difference between Ott and PLT Redex was, it is that Ott is primarily concerned with the abstract semantic meaning of your definitions, whereas PLT Redex is primarily concerned with how you would go about matching against syntax (running it). One way to see this is in the fact that in Ott, your grammar is a BNF, which is fed into a CFG parser; whereas in PLT Redex, your grammar is a pattern language for the pattern-matching machine. This should not be surprising: one would expect each tool’s design philosophy to hew towards their intended usage.

• January 13, 2014

## When a lock is better than an MVar

MVars are an amazingly flexible synchronization primitive, which can serve as locks, one-place channels, barriers, etc. or be used to form higher-level abstractions. As far as flexibility is concerned, MVars are the superior choice of primitive for the runtime system to implement—as opposed to just implementing, say, a lock.

However, I was recently thinking about GHC's BlockedIndefinitelyOnMVar exception, and it occurred to me that a native implementation of locks could allow perfect deadlock detection, as opposed to the approximate detection for MVars we currently provide. (I must emphasize, however, that here, I define deadlock to mean a circular waits-for graph, and not “thread cannot progress further.”)

Here is how the new primitive would behave:

• There would be a new type Lock, with only one function withLock :: Lock -> IO a -> IO a. (For brevity, we do not consider the generalization of Lock to also contain a value.)
• At runtime, the lock is represented as two closure types, indicating locked and unlocked states. The locked closure contains a waiting queue, containing threads which are waiting for the lock.
• When a thread takes out a free lock, it adds the lock to a (GC'd) held locks set associated with the thread. When it returns the lock, the lock is removed from this set.
• When a thread attempts to take a busy lock, it blocks itself (waiting for a lock) and adds itself to the waiting queue of the locked closure.
• Critically, references to the lock are treated as weak pointers when the closure is locked. (Only the pointer from the held lock set is strong.) Intuitively, just because a pointer to the lock doesn’t mean you can unlock; the only person who can unlock it is the thread who has the lock in their held locks set.
• If a thread attempts to take out a lock on a dead weak pointer, it is deadlocked.

Theorem. Any set of threads in a waits-for cycle is unreachable, if there are no other pointers to thread besides the pointer from the waiting queue of the locks in the cycle.

Proof. Consider a single thread in the cycle: we show that the only (strong) pointer to it is from the previous thread in the cycle. When a thread is blocked, it is removed from the run queue (which counts as a GC root). Given the assumption, the only pointer to the thread is from the waiting queue of the lock it is blocked on. We now consider pointers to the lock it is blocked on. As this lock is busy, all pointers to it are weak, except for the pointer from the thread which is holding the lock. But this is exactly the previous thread in the cycle. ■

At the cost of a weak-pointer dereference when a lock is taken out, we can now achieve perfect deadlock detection. Deadlock will be detected as soon as a garbage collection runs that detects the dead cycle of threads. (At worst, this will be the next major GC.)

Why might this be of interest? After all, normally, it is difficult to recover from a deadlock, so while accurate deadlock reporting might be nice-to-have, it is by no means necessary. One clue comes from a sentence in Koskinen and Herlihy's paper Dreadlocks: Efficient Deadlock Detection: “an application that is inherently capable of dealing with abortable lock requests...is software transactional memory (STM).” If you are in an STM transaction, deadlock is no problem at all; just rollback one transaction, breaking the cycle. Normally, one does not take out locks in ordinary use of STM, but this can occur when you are using a technique like transactional boosting (from the same authors; the relationship between the two papers is no coincidence!)

Exercise for the reader, formulate a similar GC scheme for MVars restricted to be 1-place channels. (Hint: split the MVar into a write end and a read end.)

• January 7, 2014

## So you want to add a new concurrency primitive to GHC…

One of the appealing things about GHC is that the compiler is surprisingly hackable, even when you don’t want to patch the compiler itself. This hackability comes from compiler plugins, which let you write custom optimization passes on Core, as well as foreign primops, which let you embed low-level C-- to manipulate the low-level representation of various primitives. These hooks let people implement and distribute features that would otherwise be to unstable or speculative to put into the compiler proper.

A particular use-case that has garnered some amount of interest recently is that of concurrency primitives. We engineers like to joke that, in the name of performance, we are willing to take on nearly unbounded levels of complexity: but this is almost certainly true when it comes to concurrency primitives, where the use of ever more exotic memory barriers and concurrent data structures can lead to significant performance boosts (just ask the Linux kernel developers). It’s very tempting to look at this situation and think, “Hey, we could implement this stuff in GHC too, using the provided compiler hooks!” But there are a lot of caveats involved here.

After answering a few questions related to this subject on the ghc-devs list and noticing that many of the other responses were a bit garbled, I figured I ought to expand on my responses a bit in a proper blog post. I want to answer the following questions:

1. What does it mean to have a memory model for a high-level language like Haskell? (You can safely skip this section if you know what a memory model is.)
2. What is (GHC) Haskell’s memory model?
3. How would I go about implementing a (fast) memory barrier in GHC Haskell?

### Memory models are semantics

What is a memory model? If you ask a hardware person, they might tell you, “A memory model is a description of how a multi-processor CPU interacts with its memory, e.g. under what circumstances a write by one processor is guaranteed to be visible by another.” If you ask a compiler person, they might tell you, “A memory model says what kind of compiler optimizations I’m allowed to do on operations which modify shared variables.” A memory model must fulfill both purposes (a common misconception is that it is only one or the other). To be explicit, we define a memory model as follows (adapted from Adve-Boehm):

A memory model is a semantics for shared variables, i.e. the set of values that a read in a program is allowed to return.

That’s right: a memory model defines the behavior of one the most basic operations in your programming language. Without it, you can’t really say what your program is supposed to do.

Why, then, are memory models so rarely discussed, even in a language community that is so crazy about semantics? In the absence of concurrency, the memory model is irrelevant: the obvious semantics apply. In the absence of data races, the memory model can be described quite simply. For example, a Haskell program which utilizes only MVars for inter-thread communication can have its behavior described completely using a relatively simple nondeterministic operational semantics (see Concurrent Haskell paper (PS)); software transactional memory offers high-level guarantees of atomicity with respect to reads of transactional variables. Where a memory model becomes essential is when programs contain data races: when you have multiple threads writing and reading IORefs without any synchronization, a memory model is responsible for defining the behavior of this program. With modern processors, this behavior can be quite complex: we refer to these models as relaxed memory models. Sophisticated synchronization primitives will often take advantage of a relaxed memory model to avoid expensive synchronizations and squeeze out extra performance.

### GHC Haskell’s memory (non) model

One might say the Haskell tradition is one that emphasizes the importance of semantics... except for a number of notable blind spots. The memory model is one of those blind spots. The original Haskell98 specification did not contain any specification of concurrency. Concurrent Haskell paper (PS) gave a description of semantics for how concurrency might be added to the language, but the paper posits only the existence of MVars, and is silent on how MVars ought to interact with IORefs.

One of the very first discussions that took place on the haskell-prime committee when it was inaugurated in 2006 was whether or not Concurrent Haskell should be standardized. In the discussion, it was quickly discovered that a memory model for IORefs would be needed (continued here). As of writing, no decision has been made as to whether or not IORefs should have a strong or weak memory model.

The upshot is that, as far as Haskell the standardized language goes, the behavior here is completely undefined. To really be able to say anything, we’ll have to pick an implementation (GHC Haskell), and we’ll have to infer which aspects of the implementation are specified behavior, as opposed to things that just accidentally happen to hold. Notably, memory models have implications for all levels of your stack (it is a common misconception that a memory barrier can be used without any cooperation from your compiler), so to do this analysis we’ll need to look at all of the phases of the GHC compilation chain. Furthermore, we’ll restrict ourselves to monadic reads/writes, to avoid having to wrangle with the can of worms that is laziness.

Here’s GHC’s compilation pipeline in a nutshell:

At the very top of the compiler pipeline lie the intermediate languages Core and STG. These will preserve sequential consistency with no trouble, as the ordering of reads and writes is fixed by the use of monads, and preserved throughout the desugaring and optimization passes: as far as the optimizer is concerned, the primitive operations which implement read/write are complete black boxes. In fact, monads will over-sequentialize in many cases! (It is worth remarking that rewrite rules and GHC plugins could apply optimizations which do not preserve the ordering imposed by monads. Of course, both of these facilities can be used to also change the meaning of your program entirely; when considering a memory model, these rules merely have a higher burden of correctness.)

The next step of the pipeline is a translation into C--, a high-level assembly language. Here, calls to primitive operations like readMutVar# and writeMutVar# are translated into actual memory reads and writes in C--. Importantly, the monadic structure that was present in Core and STG is now eliminated, and GHC may now apply optimizations which reorder reads and writes. What actually occurs is highly dependent on the C-- that is generated, as well as the optimizations that GHC applies, and C-- has no memory model, so we cannot appeal to even that.

This being said, a few things can be inferred from a study of the optimization passes that GHC does implement:

• GHC reserves the right to reorder stores: the WriteBarrier mach-op (NB: not available from Haskell!) is defined to prevent future stores from occurring before preceding stores. In practice, GHC has not implemented any C-- optimizations which reorder stores, so if you have a story for dealing with the proceeding stages of the pipeline, you can dangerously assume that stores will not be reordered in this phase.
• GHC reserves the right to reorder loads, and does so extensively. One of the most important optimizations we perform is a sinking pass, where assignments to local variables are floated as close to their use-sites as possible. As of writing, there is no support for read barrier, which would prevent this floating from occurring.

There are a few situations where we happen to avoid read reordering (which may be dangerously assumed):

• Reads don’t seem to be reordered across foreign primops (primops defined using the foreign prim keywords). This is because foreign primops are implemented as a jump to another procedure (the primop), and there are no inter-procedural C-- optimizations at present.
• Heap reads don’t seem to be reordered across heap writes. This is because we currently don’t do any aliasing analysis and conservatively assume the write would have clobbered the read. (This is especially dangerous to assume, since you could easily imagine getting some aliasing information from the frontend.)

Finally, the C-- is translated into either assembly (via the NCG—N for native) or to LLVM. During translation, we convert the write-barrier mach-op into an appropriate assembly instruction (no-op on x86) or LLVM intrinsic (sequential consistency barrier); at this point, the behavior is up to the memory model defined by the processor and/or by LLVM.

It is worth summarizing the discussion here by comparing it to the documentation at Data.IORef, which gives an informal description of the IORef memory model:

In a concurrent program, IORef operations may appear out-of-order to another thread, depending on the memory model of the underlying processor architecture...The implementation is required to ensure that reordering of memory operations cannot cause type-correct code to go wrong. In particular, when inspecting the value read from an IORef, the memory writes that created that value must have occurred from the point of view of the current thread.

In other words, “We give no guarantees about reordering, except that you will not have any type-safety violations.” This behavior can easily occur as a result of reordering stores or loads. However, the type-safety guarantee is an interesting one: the last sentence remarks that an IORef is not allowed to point to uninitialized memory; that is, we’re not allowed to reorder the write to the IORef with the write that initializes a value. This holds easily on x86, due to the fact that C-- does not reorder stores; I am honestly skeptical that we are doing the right thing on the new code generator for ARM (but no one has submitted a bug yet!)

### What does it all mean?

This dive into the gory internals of GHC is all fine and nice, but what does it mean for you, the prospective implementor of a snazzy new concurrent data structure? There are three main points:

1. Without inline foreign primops, you will not be able to convince GHC to emit the fast-path assembly code you are looking for. As we mentioned earlier, foreign primops currently always compile into out-of-line jumps, which will result in a bit of extra cost if the branch predictor is unable to figure out the control flow. On the plus side, any foreign primop call will accidentally enforce the compiler-side write/read barrier you are looking for.
2. With inline foreign primops, you will still need make modifications to GHC in order to ensure that optimization passes respect your snazzy new memory barriers. For example, John Lato’s desire for a load-load barrier (the email which kicked off this post) will be fulfilled with no compiler changes by a out-of-line foreign primop, but not by the hypothetical inline foreign primop.
3. This stuff is really subtle; see the position paper Relaxed memory models must be rigorous, which argues that informal descriptions of memory models (like this blog post!) are far too vague to be useful: if you want to have any hope of being correct, you must formalize it! Which suggests an immediate first step: give C-- a memory model. (This should be a modest innovation over the memory models that C and C++ have recently received.)

For the rest of us, we’ll use STM instead, and be in a slow but compositional and dead-lock free nirvana.

• January 1, 2014